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Abstract 

This  dissertation  explores  multiscale  discriminant  basis  selection,  as  well  as  the  improve¬ 
ment  of  classification  reliability  through  context-dependent  integration  of  soft  decisions. 
These  methods  are  applied  to  texture  and  radar  signature  classification,  document  image 
segmentation,  and  human  face  recognition. 
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Preface 

A  successful  pattern  recognition  scheme  starts  with  efficient  extraction  of 
the  most  discriminant  information  elements  from  various,  possibly  imprecise, 
sources,  followed  by  an  intelligent  combination  of  this  information  in  a  context- 
dependent  framework  of  low  complexity. 

Conventional  multiscale  basis  selection  and  feature  extraction  based  on 
compression-  and  approximation-based  criteria  are  not  necessarily  the  best  ap¬ 
proaches  for  classification  and  segmentation  purposes.  Instead,  a  class  separabil¬ 
ity  based  approach  is  preferable.  In  this  dissertation,  we  explore  methodologies 
for  lower-dimensional  adaptive  multi-scale  discriminant  basis  selection.  Depend¬ 
ing  on  the  task,  these  methodologies  are  applied  to  local  windows  or  to  the  whole 
pattern.  Our  tools  in  this  analysis  are  derived  from  theories  of  wavelet  packets 
and  multi-scale  local  bases  on  the  one  hand,  and  from  the  statistical  theory  of 
discriminant  cluster  analysis  on  the  other  hand.  The  goal  is  to  find  efficient 
multi-scale  representations  that  yield  maximum  between-class  separations  and 
minimum  within- class  scatters. 

We  also  investigate  the  effectiveness  of  soft  decisions  in  representing  the 
vagueness,  uncertainty  and  imprecision  of  the  classification  sources.  Based  on 
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the  principle  of  least  commitment  in  designing  pattern  recognition  and  consensus- 
theoretical  concepts,  we  try  to  improve  the  reliability  of  our  classification  system 
through  integration  of  soft  decisions  obtained  from  various  observations  and/or 
sources.  The  combination  of  decisions  is  based  on  the  discrimination  power  of 
each  source  and  its  relevance  to  the  current  observation.  We  use  ideas  from 
consensus  theory,  fuzzy  neural  learning,  and  evidential  reasoning. 

Our  methods  of  multi-scale  local/global  basis  selection  and  context-dependent 
decision  integration  are  applied  to  in  several  different  domains,  including  texture 
and  document  image  classification  and  segmentation,  radar  signature  classifica¬ 
tion,  and  human  face  recognition.  The  results  show  that  superior  or  highly 
competitive  performance  can  be  obtained  using  small  feature  sets  and  simple 
classifiers.  The  resulting  systems  are  typically  of  low  complexity  and,  since  no 
iterative  computations  are  involved,  most  of  the  calculations  can  be  done  in 
parallel.  The  proposed  ideas  can  be  extended  in  several  directions  and  can  be 
applied  to  many  pattern  recognition  and  segmentation  tasks. 
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Chapter  1 

Introduction 

1.1  Introduction 

Pattern  recognition  is  the  study  of  theories  and  algorithms  for  automating  the 
process  of  recognition  through  efficient  representation  of  relevant  information 
and  its  analysis  using  intelligent  schemes.  The  success  of  pattern  recognition 
systems  depends  not  only  on  the  power  of  the  data  processing  algorithm,  but 
also  on  the  proper  representation  of  input  data  so  that  all  the  salient  aspects 
of  data  for  the  specific  task  at  hand  are  captured  and  utilized  while  all  the  ir¬ 
relevant  information  is  discarded.  With  a  poor  knowledge  representation,  even 
a  powerful  and  sophisticated  algorithm  may  give  inferior  results.  Improvement 
in  efficiency  of  data  representation  may  achieve  more  benefit  with  less  effort. 
Another  fact  which  is  sometimes  overlooked  is  the  significance  of  managing  in¬ 
termediate  results  and  decisions,  in  terms  of  representing  or  saving  them  in 
the  right  format  so  that  a  minimum  amount  of  information  is  lost  as  far  as 
end-to-end  performance  is  concerned.  One  of  the  most  important  principles  in 
designing  pattern  classification  schemes  is  the  principle  of  least  commitment, 
stated  by  Marr  [59],  which  simply  says  “don’t  do  something  that  may  later 
have  to  be  undone” .  This  principle  is  consistent  with  utilizing  soft  decisions  as 
intermediate  results  and  carrying  them  along  until  a  crisp  decision  is  required. 

A  general  schematic  of  a  context-dependent  classification/recognition  process 
is  shown  in  Figure  1.1.  The  process  starts  with  making  a  set  of  observations 
that  can  be  ordered  in  time  or  space,  possibly  as  results  of  windowing.  In 
some  applications  only  a  few  observations  may  be  available.  The  first  step 
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Figure  1.1:  Context-dependent  classification  and  recognition  using 
decision  integration. 


is  to  find  an  effective  and  appropriate  representation  of  the  signal  or  image, 
which  based  on  a  given  criterion,  represents  only  the  most  relevant  information 
in  a  compact  form.  The  economy  of  clues  in  humans’  recognition,  and  the 
fact  that  classification  systems  with  small  numbers  of  parameters  have  better 
generalization,  are  computationally  more  cost-effective  and  also  can  be  trained 
and  adapted  faster,  are  motivations  for  efficient  feature  extraction  techniques. 

Feature  extraction  can  also  be  thought  of  as,  or  be  replaced  by,  measurements 
from  various  sources.  These  sources  in  general  may  be  imprecise  with  certain 
levels  of  reliability  or  significance.  Based  on  a  consensus  rule  and  an  objective 
performance  measure  one  needs  to  combine  information  or  decisions  provided 
by  various  sources/features  to  obtain  more  reliable  performance.  This  requires 
an  objective  evaluation  of  decisions  obtained  from  individual  sources  in  terms 
of  their  impreciseness,  uncertainty,  or  reliability. 

Also  attached  to  the  concept  of  decision  integration  is  the  idea  of  incor- 
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porating  context  in  the  final  decision.  The  context  can  be  defined  and  used 
in  a  temporal  or  spatial  sense  or  as  any  additional  side  information.  Utilizing 
spatial/temporal  context  requires  another  level  of  decision  integration  using  de¬ 
cisions  obtained  in  a  “neighborhood”  around  current  observations  based  on  their 
objectively  defined  relevance  to  the  current  data  or  decision. 

In  this  thesis  we  investigate  general  methods  of  pattern  classification  through 
adaptive  multi-scale  local  basis  design.  We  also  investigate  the  improvements 
obtained  as  a  result  of  using  soft  local  decisions  as  opposed  to  hard  decisions  and 
we  link  this  idea  to  fuzzy  neural  networks,  soft  decision  integration  and  context- 
dependent  evidential  reasoning.  Our  methodology  is  based  on  the  following 
objective  and  observations. 

Objective:  The  objective  of  this  study  is  to  develop  a  fairly  general  pattern 
and  signal  classification  and  segmentation  scheme  that  is  highly  robust  to  signal 
and  pattern  distortions  and  can  provide  competitive  results  with  low  complexity. 
Our  main  applications  of  interest  are  document  image  processing,  texture  anal¬ 
ysis,  radar  target  classification,  and  face  recognition.  We  will  test  our  proposed 
schemes  on  these  tasks  and  compare  our  results  with  other  methods. 

Observations:  There  are  several  primary  observations  that  can  lead  us 
toward  a  reasonable  approach  to  achieving  the  above  objective: 

1.  Best  Representations  for  Approximation  or  Discrimination.  The 
best  and  most  compact  representation  of  a  set  of  signals  for  compression  or 
approximation  purposes  may  not  be  appropriate  for  classifying  them  [2].  For 
discrimination  purposes,  instead  of  the  description  length,  entropy,  or  rate  dis¬ 
tortion,  the  criterion  should  be  class  separability.  In  other  words,  we  should  seek 
a  small-dimensional  representation  space  with  maximum  discrimination  power. 
In  a  feature  space  with  high  discrimination  power,  within  each  class  the  feature 
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points  show  small  variation  but  feature  points  from  different  classes  are  highly 
separated.  The  most  discriminating  features  may  or  may  not  correspond  to  high 
energy  content  or  to  major  principal  components  of  the  signal(s). 

2.  Multi-scale  Representation  and  Classification:  Multi-scale  signal/pattern 
representations  and  multi-scale  classification  have  been  found  to  be  very  effec¬ 
tive  in  many  signal  processing  applications  ranging  from  signal  compression 
and  coding  systems  to  pattern  recognition  schemes.  Motivated  by  the  success 
and  plausibility  of  wavelets  in  classification  systems,  we  will  study  appropriate 
choices  of  local  basis  functions  for  the  detection,  classification  or  segmentation 

of  signals  and  images.  The  goal  is  to  build,  from  a  library  of  modulated  wave¬ 
forms,  the  best  set  of  discriminant  basis  functions  relative  to  which  the  given 
collection  of  signals  shows  the  largest  class  separability  which  in  turn  results  in 
simple  and  efficient  algorithms  for  classification. 

3.  Ambiguity /Impreciseness  in  Local  Decisions:  In  many  signal/image 
classification  and  segmentation  applications,  because  of  a  variety  of  constraints 
we  have  to  base  our  decisions  on  local,  incomplete  or  noisy  views  of  the  desired 
pattern.  Therefore,  there  is  usually  some  ambiguity,  imprecision  or  fuzziness 
associated  with  our  local  decisions.  In  some  applications  the  fuzziness  is  inherent 
to  the  problem  and  is  not  necessarily  due  to  noisy  or  incomplete  data.  In  such 
cases  no  hard  decision  can  be  accurate,  and  therefore  it  is  more  appropriate  to 
use  soft  decisions  to  reflect  mixed  memberships.  Expressing  all  initial  decisions 
using  real- valued  soft  decision  vectors,  one  has  to  find  a  way  to  reduce  their 
uncertainty  and  reach  an  acceptable  confidence  level  using  a  set  of  consensus 
rules. 

4.  Incorporating  Context  Information  through  Decision  Integration: 
The  effective  use  of  context  information  in  human  perception  is  one  of  the  key 
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sources  of  its  strength.  But  using  context  information  in  computer  vision  and 
pattern  recognition  efficiently  is  not  a  trivial  problem.  Many  pattern  recognition 
schemes  attempt  to  identify  relevant  context  information  from  different  sources 
and  incorporate  it  in  their  final  decisions.  The  improvement  due  to  the  use  of 
context  information  becomes  more  noticeable  when  primary  decisions  are  im¬ 
precise  due  to  a  local/incomplete  view  of  the  patterns.  The  integration  of  soft 
local  decisions  over  a  “context  area”  within  and  across  scales  can  reduce  the 
level  of  uncertainty  or  increase  the  confidence  of  final  decisions. 

The  organization  of  this  dissertation  is  as  follows.  In  the  first  two  chapters 
we  discuss  some  new  ideas  for  pattern  recognition  in  a  general  and  analytical 
form.  In  the  following  three  chapters  we  investigate  the  results  of  applying  these 
ideas  to  various  signal  and  image  processing  tasks. 

Chapter  2  talks  about  best  local/global  discriminant  basis  selection  and  fea¬ 
ture  extraction.  In  this  chapter  we  study  multi-scale  discriminant  feature  ex¬ 
traction  for  classification  and  segmentation  purposes.  These  discriminant  bases 
should  be  designed  so  that  maximum  separability  of  clusters  in  the  feature  space 
can  be  achieved  using  small-dimensional  feature  vectors.  Depending  on  the  task 
one  may  look  for  best  features  based  on  local  windows  on  the  signal  or  image,  or 
for  global  features  using  all  the  data  points  in  a  signal/pattern.  For  segmenta¬ 
tion  tasks,  for  acceptable  localization  of  region  boundaries,  one  has  to  use  small 
local  windows  and  the  challenge  is  to  obtain  consistent  results  with  a  limited 
and  sometimes  insufficient  view  of  the  signal/pattern.  On  the  other  hand,  for 
recognition  and  classification  of  objects  with  major  macroscopic  structure,  in 
order  to  capture  all  the  geometrical  relationships  between  object  components, 
one  may  need  to  view  the  signal  as  a  whole. 
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Chapter  3  focuses  on  the  utilization  of  soft  decisions  and  context-dependent 
decision  integration  rules.  This  chapter  discusses  methods  of  finding  a  consensus 
among  a  set  of  experts  or  imprecise  information  sources  with  different  levels 
of  reliability.  Integration  of  spatial/temporal  context  information  based  on  a 
relevance  function  is  also  discussed.  The  ideas  presented  in  this  chapter  are  based 
on  evidential  reasoning  and  similarity-based  fuzzy  decision  systems  discussed  in 
the  literature.  After  discussing  our  analytical  proposals  we  test  them  on  a  variety 
of  applications. 

Chapter  4  presents  results  of  applying  multiscale  discriminant  analysis  to 
some  real  1-D  and  2-D  signal  classification  problems.  To  test  our  method  of 
classifying  1-D  signals  we  use  a  set  of  low-resolution  radar  signatures  for  au¬ 
tomatic  target  recognition.  Then  we  investigate  the  effectiveness  of  applying 
similar  methods  to  classification  and  segmentation  of  2-D  patterns/images,  for 
which  we  use  a  set  of  texture  images.  The  results  of  these  analyses  are  compared 
to  those  using  existing  wavelet-based  classification  systems. 

Chapter  5  treats  layout-independent  document  page  segmentation  using  adap¬ 
tive  multiscale  discriminant  features.  We  present  an  algorithm  for  layout-independent 
document  page  segmentation  based  on  document  texture  which  makes  use  of 
multiscale  feature  vectors  and  fuzzy  local  decision  information  to  overcome  the 
shortcomings  of  previous  segmentation  approaches  when  applied  to  complex  doc¬ 
uments.  Multiscale  feature  vectors,  computed  using  a  wavelet  packet  tree  which 
is  designed  based  on  document  domain  specific  information,  are  classified  locally 
using  a  neural  network  to  allow  soft/fuzzy  multi-class  membership  assignments. 

Chapter  6  focuses  on  analysis  and  recognition  of  human  faces.  In  this  chap¬ 
ter  the  discriminatory  power  of  various  human  facial  features  is  studied  and  a 
new  scheme  for  Automatic  Face  Recognition  (AFR)  is  proposed.  The  first  part 
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of  the  chapter  focuses  on  the  Linear  Discriminant  Analysis  (LDA)  of  different 
aspects  of  human  faces  in  the  spatial  as  well  as  wavelet  domains.  This  anal¬ 
ysis  allows  us  to  objectively  evaluate  the  significance  of  visual  information  in 
different  parts/features  of  the  face  for  identifying  the  human  subject. 

The  LDA  of  faces  also  provides  us  with  a  small  set  of  features  that  carry  the 
most  relevant  information  for  classification  purposes.  The  features  are  obtained 
through  eigenvector  analysis  of  scatter  matrices  with  the  objective  of  maximizing 
between-class  and  minimizing  within-class  variations.  The  result  is  an  efficient 
projection-based  feature  extraction  and  classification  scheme  for  AFR.  Although 
all  of  the  face  recognition  experiments  in  this  section  are  performed  at  a  single 
scale,  the  underlying  LDA-based  feature  extraction  ideas  can  also  be  applied  to 
wavelet  decompositions. 

1.2  Summary  of  Contributions 

This  dissertation  reports  the  following  new  contributions  ranging  from  analytical 
results  to  new  applications: 

•  Discriminant  Local  Basis  Design:  The  design  of  local  bases  for  best  dis¬ 
crimination  performance  using  separability  criteria;  the  application  of  separa¬ 
bility  measures  for  best  basis  selection  or  composition  from  an  orthogonal  or 
redundant  dictionary  of  local  waveforms. 

•  Context-Dependent  Multisource  Soft  Decision  Integration:  Utilization 
of  a  consensus  rule  that  integrates  soft  decisions  based  on  their  discrimination 
power  and  exploits  spatial/ temporal  context  using  a  corresponding  relevance 
criterion. 

•  Wavelet  Packet  Based  Layout  Independent  Document  Page  Segmen¬ 
tation:  The  application  of  texture-based  adaptive  multiscale  features  using 


7 


wavelet  packets  along  with  context-dependent  soft  decision  integration  for  seg¬ 
mentation  of  document  pages  with  complex  layouts. 

•  Discriminant  Analysis  and  Recognition  of  Human  Faces:  The  applica¬ 
tion  of  linear  discriminant  analysis  to  objective  analysis  of  human  facial  features 
and  automatic  face  recognition  using  a  simple  projection-based  method  that  uti¬ 
lizes  separability  measures  for  feature  extraction  and  multisource  soft  decision 
integration. 

Beside  these  contributions,  competitive  results  in  automatic  radar  target  recog¬ 
nition  and  texture  segmentation  using  very  small  multiscale  feature  sets  are  also 
presented. 
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Chapter  2 

Multi- scale  Discriminant  Features 

2.1  Introduction 

Classification  of  patterns  as  performed  by  humans  is  usually  based  on  a  small 
number  of  important  attributes  which  often  have  multi-scale  organizations.  In 
practice  we  are  usually  confronted  with  pattern  recognition  tasks  where  physi¬ 
cally  or  logically  relevant  information  is  not  sufficiently  well  defined  and  under¬ 
stood.  In  such  applications  there  is  a  need  to  devise  algorithmic  approaches  to 
finding  and  evaluating  a  set  of  multi-scale  classification  attributes  that  show  the 
maximum  discriminatory  potential  in  a  small-dimensional  feature  space. 

Recently  the  application  of  wavelets  and  multi-rate  filter  banks  [70,  51]  to 
multi-scale  feature  extraction  has  received  significant  attention.  Wavelet-based 
features  have  been  shown  to  be  efficient  representations  for  detection,  classifi¬ 
cation  and  segmentation  of  1-D  signals,  e.g.  speech,  music,  and  other  acoustic 
or  radar  transients  [51,  26,  25).  Successful  texture  and  image  analysis  schemes 
based  on  wavelet  or  Gabor  transforms  have  also  been  proposed  [44,  13,  82].  Ex¬ 
amples  of  texture  and  image  segmentation  using  wavelet  packets  are  given  in 
[55,  29].  In  addition  to  engineering  tests,  the  evidence  that  some  multi-resolution 
and  spatial  frequency  analysis  is  performed  by  our  visual  and  auditory  systems, 
demonstrated  by  psychophysical  studies  [83,  93],  shows  the  biological  plausibility 
of  wavelet-based  methods. 

Motivated  by  the  success  and  plausibility  of  wavelet-based  classification  sys¬ 
tems,  in  the  first  part  of  this  chapter  we  review  the  beisic  methodologies  for 
local  basis  selection  found  in  the  literature.  We  then  present  our  proposed 
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discrimination-based  signal  decomposition  scheme.  Our  objective  is  to  build, 
from  a  library  of  modulated  waveforms,  a  set  of  suitable  basis  functions  relative 
to  which  the  given  collection  of  signals  shows  the  largest  class  separability,  which 
in  turn  results  in  simple  and  efficient  algorithms  for  classification.  We  investigate 
appropriate  algorithms  for  both  orthogonal  bases  and  redundant  dictionaries  of 
local  functions. 

Since  classification  systems  with  small  numbers  of  parameters  provide  better 
generalization  and  adaptation  performance  at  lower  computational  cost  [34],  we 
are  interested  in  dimensionality  reduction  techniques.  It  is  usually  advantageous 
to  sacrifice  some  information  in  order  to  keep  the  number  of  system  parameters 
to  a  minimum.  With  this  observation  and  our  suggested  basis  selection  idea,  we 
also  study  the  issue  of  optimal  extraction  of  low-dimensional  feature  vectors  from 
multi-scale  decompositions  of  signals.  Our  approach  focuses  on  the  exploitation 
of  class-specific  differences  obtained  through  inspection  of  a  pre-defined  class 
separation  [34,  24]  attainable  from  the  multiscale  decomposition,  and  on  finding 
a  linear  map  that  provides  the  smallest  set  of  features  relative  to  which  the 
given  collection  of  signals  shows  the  largest  class  separability.  This  in  turn 
results  in  simple  and  efficient  classification  schemes.  Although  most  of  our 
discussions  are  about  wavelet  packet  bases,  the  suggested  basis  selection  method 
can  be  applied  to  other  tree-structured  local  basis  functions,  e.g.  libraries  of 
local  sine/cosine  functions  [19],  and  also  to  other  tasks  such  as  classification 
of  acoustic  transients  and  biomedical  and  satellite  images.  It  is  shown  that 
simple  search  techniques  can  be  devised  if  the  basis  functions  are  orthogonal 
and  can  be  put  in  a  tree  structure.  The  multi-scale  dimensionality  reduction 
idea  can  be  used  for  both  orthogonal  and  non-orthogonal  libraries  of  local  basis 
functions,  e.g.  local  sine/cosine  functions,  Gabor  functions,  and  even  composite 
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and  redundant  basis  libraries  [57]. 

The  idea  of  designing  local  bases  using  class  separability  criteria  has  been 
studied  concurrently  by  Saito  and  Coifman  [72].  Their  proposed  algorithm  is 
based  on  local  energy  features  and  additive  discrimination  costs  and  their  tests 
are  based  on  synthetic  data.  Part  of  this  dissertation  reports  the  results  of 
similar  but  independent  research  which  is  applicable  to  non-additive  costs,  non- 
orthogonal  bases,  and  arbitrary  local  features.  Also,  in  the  next  few  chapters  the 
results  of  tests  on  several  real  signal  and  image  classification  and  segmentation 
tasks  are  provided.  Our  approach  to  adaptive  multi-scale  local  basis  design  is 
general  in  the  sense  of  its  applicability  to  different  signal  and  image  classification 
tasks. 

The  organization  of  this  chapter  is  as  follows: 

In  Section  2.2,  a  brief  introduction  to  multi-scale  signal  representations  with 
emphasis  on  wavelet  packets  is  given.  Several  known  [34,  24]  measures  of  class 
separability  are  summarized  and  a  new  separability-based  tree-structured  local 
basis  design  is  suggested  in  Section  2.3.  Section  2.4  describes  a  related  but 
independent  idea  of  dimensionality  reduction  of  multi-scale  features,  followed 
by  its  extension  to  redundant  and  non-orthogonal  basis  dictionaries,  in  Section 
2.5.  Some  comments  on  multi-scale  and  context-dependent  classification  and 
segmentation  are  provided  in  Section  2.6. 

2.2  Multi-scale  Signal  Representations 

The  optimal  representation  of  signals  in  the  time-frequency  plane  (or  the  so- 
called  Phase  Plane  [19,  56])  is  an  active  area  of  research,  where  the  optimality  is 
a  task-dependent  issue.  In  most  time-frequency  decompositions,  signals  are  pro¬ 
jected  onto  a  set  of  waveforms  or  time-frequency  atoms  [57].  A  general  family 
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of  time-frequency  atoms  can  be  generated  by  scaling,  translating  and  modulat¬ 
ing  a  single  window  function  g{t)  €  where  g{t)  is  a  real,  continuously 

differentiable  and  0{:p^)  function  satisfying 

15-1  =  1;  and  J  g{t)  ^  0;  and  5f(0)  7^  0;  (2.2.1) 

Therefore  any  element  of  the  dictionary  is  of  the  form 

g,{t)  =  )e«‘  (2.2.2) 

and  can  be  identified  by  the  triple  7  =  (s,^,  u)  €  F  =  (R'''  x  R^),  where  s,^ 
and  u  represent  scaling,  modulation,  and  translation  factors,  respectively  [57]. 
These  waveforms  form  a  dictionary 

D  =  {g,(t)  :  7  €  r}  (2.2.3) 

of  basis  functions  which  may  or  may  not  be  orthogonal  or  even  complete  and 
may  or  may  not  have  a  tree  structure.  A  function/signal  is  decomposed  in  a 
dictionary  D  by  its  projections  onto  the  elements  of  D.  The  waveforms 
must  be  selected  adaptively  based  on  the  local  properties  of  the  desired  signals, 
so  that  the  expansion  coefl&cients  provide  the  desired  information  most  “effi¬ 
ciently”.  The  best  decomposition  strategy  also  depends  on  the  characteristics 
of  the  dictionary. 

The  smallest  possible  dictionary  is  a  basis  of  if,  but  general  dictionaries 
are  redundant  families  of  waveforms /vectors.  Examples  of  orthogonal  bases 
are  Wavelet  Packet  (WP)  and  Local  Trigonometric  Basis  (LTB)  functions;  see 
Figure  2.1.  On  the  other  hand,  the  general  family  of  Gabor  functions  forms  a 
redundant  dictionary  of  bases.  In  the  following  we  review  the  theory  of  best 
signal  decompositions  using  tree-structured  local  bases,  where  we  focus  on  WP 
bases.  Also,  we  review  the  concepts  of  best  decompositions  in  the  framework  of 
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(a)  / 


(b) 


Figure  2.1:  Partitioning  of  the  phase  plane  for  (a)  wavelet  packet 
basis  (adaptive  windowing  along  frequency  ajds);  (b)  local  trigono¬ 
metric  basis  (adaptive  windowing  along  time/space  axis) 

redundant  dictionaries.  We  then  extend  the  idea  of  best  approximation-based 
multiscale  representation  to  finding  the  most  discriminatory  representations  for 
classification  purposes. 


2.2,1  Multiscale  Orthogonal  Bases:  Wavelets 

Wavelet  transforms  [56,  22]  and  their  generalized  form,  called  wavelet  packets, 
provide  signal  analysis  through  smooth  partitioning  of  the  frequency  axis.  The 
waveforms  in  WT  and  WP  dictionaries  have  a  tree  structure  and  they  form  an 
orthonormal  basis  for  T^(R). 

We  begin  with  an  exact  Quadrature  Mirror  Filter  (QMF)  [19]  h{n)  satisfying 
Y2  h{n  -  2k)h{n  -  2f)  =  6k/  and  ^  h{n)  —  (2.2.4) 

n  n 

Let  g{k)  —  {—l)'‘h{k  -f  1)  and  define  the  mappings  Fi  from  P(Z)  onto 
“-^2(2Z)” 

Fo{s}{t)  =  2Y,s{k)hik-2i)  (2.2.5) 

k 

Fi{s}{i)  =  2YjS{k)g{k  -2i) 

k 
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which  can  be  considered  as  convolutions  followed  by  down-sampling  operations. 
The  map  F{s)  =  Fo{s)  ©  ^1(5)  €  f{2Z)  0^^(2Z)  is  orthogonal,  and  satisfies 
alias  cancellation  and  perfect  reconstruction  conditions 


FoF;  =  FiF;  =  / 

(2.2.6) 

FiF*  =  FoFi*  =  0 

(2.2.7) 

Fo*Fo-FFi*Fx  =  / 

(2.2.8) 

where  Fq  and  F^  are  the  adjoint  (i.e.  upsampling  and  anticonvolution)  opera¬ 
tions  corresponding  to  Fq  and  Fi  respectively.  This  mapping  is  the  basic  block 
of  all  wavelet  transform  and  wavelet  packet  trees  [19].  Application  of  F  to  each 
node/subband  s  projects  s  onto  two  orthogonal  subspaces  Fo{s)  and  ^1(5)  which 
correspond  to  the  smoothed  version  of  s  and  the  remaining  details  respectively. 
Thus  each  node  in  the  tree  represents  a  subspace  of  its  parent’s  space  and  each 
subspace  is  the  orthogonal  direct  sum  of  its  two  children.  The  functions  g  and 
h  represent  the  low-pass  and  high-pass  filters,  respectively.  Also  HI  and  Gl  are 
the  frequency  responses  of  the  corresponding  ID  filters,  used  in  the  filter  bank 
implementation  of  the  system,  shown  in  Figure  2.2.  In  this  Figure  V  and  W  are 
orthogonal  subspaces  generated  at  each  level  of  decomposition. 

In  the  wavelet  transform  the  decomposition  process  is  iterated  on  the  low- 
frequency  component  and  at  each  iteration  the  high-frequency  coefficients  are 
retained  intact.  These  iterations  result  in  a  pyramidal  tree  structure,  which  al¬ 
lows  signal  analysis  by  dyadically  partitioning  its  spectrum  more  and  more  finely 
toward  the  low  frequency  regions.  While  for  many  clcisses  of  applications  and 
signals  this  pyramidal  multiresolution  representation  is  appropriate,  for  others  it 
becomes  restrictive.  For  many  classes  of  signals,  e.g.  textures,  document  images, 
and  many  acoustic  signals,  where  a  major  part  of  the  energy  or  “information” 
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Figure  2.2:  Computing  the  pyramidal  wavelet  transform  by  applying 
the  Fo  and  Fi  operations  and  multirate  filtering.  The  filter-bank 
implementation  (left);  the  tree  structure  (right). 


lies  in  the  mid  to  upper  frequency  ranges,  the  pyramidal  wavelet  transform  is 
not  suitable  because,  regardless  of  the  spectral  characteristics  of  the  signal,  it 
only  allows  finer  and  finer  resolutions  toward  the  lower  frequency  bands  [29,  55]. 
On  the  other  hand,  fast  wavelet  packet  analysis  algorithms  permit  us  to  perform 
adaptive  Fourier  windowing  of  a  signal  by  an  optimal  and  smooth  partitioning 
of  the  frequency  axis. 

Define  the  following  sequence  of  functions: 

W2n{x)  =  V2Ekh{k)Wn{2x-k) 

W2nU^)  =  y/2Y.k9{k)Wn{2x  -  k) 

A  Wavelet  Packet  Basis  of  F'^(R-)  is  any  orthonormal  basis  selected  from  the 
functions  —  j)  [19].  The  three  parameters  {k,n,j}  have  physical  in¬ 

terpretations  of  scale,  frequency  (or  sequency),  and  position,  respectively.  Thus 
each  library  of  Wavelet  Packet  (WP)  bases  can  be  organized  as  a  subset  of  a 
full  binary  tree.  In  WP  analysis  both  low-  and  high-frequency  components  of 
the  signal  can  be  decomposed  at  each  iteration,  and  thus  the  corresponding  WP 
tree  can  grow  in  different  directions. 

Wavelet  packet  expansions  correspond  algorithmically  to  adaptive  subband 


(2.2.9) 
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Figure  2.3:  Filter  bank  structure  for  computing  wavelet  packets 
(top).  An  example  of  energy-based  non-uniform  subband  decom¬ 
position  (bottom). 

decompositions  of  signals  and  images  using  multi-rate  filter  banks  which  are 
widely  used  in  signal  compression  systems  [89,  2,  68];  see  Figure  2.3.  In  most 
applications  of  wavelet  analysis  to  multi-dimensional  signals  and  images,  for 
simplicity,  the  signal  space  is  assumed  to  be  a  separable  Hilbert  space  and  in 
filter  bank  implementations  of  2-D  wavelet  packets  separable  filters  along  the 
row  and  column  directions  are  used,  i.e. 

Hu{ujx,Uy)  =  Hih{tOx,Uy)  =  ■  G{ojyj2.2.10) 

~  G{u}x^  '  H(u}y^  ~  G(^Ulx)  '  C?(Wy) 

where  H  and  G  are  the  1-D  low-pass  and  high-pass  filters  respectively,  defined 
above,  and  the  first  and  second  subscripts  show  the  low-pass  or  high-pass  char¬ 
acteristics  of  the  filters  in  the  row  and  column  directions.  Figures  2.4  and  2.5 
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Figure  2.4:  2-D  wavelet  transform:  the  partitioning  of  the  2-D  spec¬ 
trum,  the  pyramid  structure  of  the  tree,  and  the  filter-bank  imple¬ 
mentation. 


Figure  2.5:  Multirate  separable  filter  bank  structure  for  computing 
two-dimensional  wavelet  packets  (left);  a  2-D  wavelet  packet  tree, 
where  each  node  can  be  decomposed  into  four  child  nodes  (right). 

show  examples  of  the  filter-bank  implementations  of  the  2-D  separable  WT  and 
WP,  respectively. 

Another  way  of  optimally  representing  signals  in  the  time-frequency  plane  is 
to  perform  adaptive  smooth  partitioning  of  the  temporal/spatial  axis,  as  illus¬ 
trated  in  Figure  2.1.  This  leads  to  the  “dual”  or  “conjugate”  of  wavelet  packets, 
the  so  called  “Local  Trigonometric  Basis”  (LTB)  functions  [19].  It  can  be  shown 
that  it  is  possible  to  partition  the  real  line  into  disjoint  intervals  smoothly  and 
construct  orthonormal  bases  on  each  interval.  Dyadically  partitioning  the  time 
axis  forms  a  binary  tree  of  local  bases  that  can  be  adaptively  designed  to  opti- 
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mize  a  predefined  cost  function.  This  idea  has  been  studied  in  [19]  in  the  context 
of  best  local  basis  selection  using  an  entropy  criterion  for  compression  purposes. 

Wavelet  Packet  and  Local  Trigonometric  Basis  functions  are  examples  of 
tree-structured  orthogonal  basis  functions.  In  the  following  we  focus  on  wavelet 
packets  but  most  of  the  results  hold  equivalently  for  LTB’s  or  any  other  tree 
structured  local  basis.  In  particular,  all  results  and  algorithms  are  applicable  to 
M-ary  wavelet  packet  trees  [19].  Since  keeping  all  the  coefficients  in  a  WP  tree 
leaves  us  with  a  redundant  set,  one  asks  about  the  optimal  tree  structure  for  a 
given  task.  In  other  words  the  flexibility  of  a  WP  tree  enables  us  to  form  the 
WP  tree  based  on  a  given  task-dependent  criterion. 

In  designing  wavelet  packet  trees,  one  either  takes  a  divide  and  conquer 
approach,  starting  from  the  most  refined  sub-space  decomposition  and  moving 
upward  in  the  tree  by  merging  “adjacent”  nodes  “appropriately”,  or  starts  from 
the  root  and  performs  iterative  decomposition  of  each  node  into  its  subspaces 
if  this  is  “appropriate”.  In  either  case  the  “appropriate”  choice  is  based  on  a 
pre-selected  task-dependent  criterion. 

Let  us  consider  the  first  approach.  Using  the  fact  that  in  wavelet  packet 
trees,  at  each  level,  subspaces  are  orthogonal,  and  considering  the  redundancy 
between  a  parent  node  and  its  children  nodes,  one  can  evaluate  the  pre-selected 
cost  function  for  the  parent  node  and  for  the  combination  of  its  children,  and 
by  comparing  the  two  values  decide  whether  to  retain  the  parent  node  or  the 
children.  Continuing  this  test  for  all  nodes  and  levels  provides  the  tree  structure 
appropriate  for  the  specific  task  based  on  the  pre-selected  criterion.  The  depth 
of  the  tree  is  limited  by  complexity  and  other  considerations. 

Depending  on  the  specific  application,  criteria  can  be  used  to  build  the  opti¬ 
mal  wavelet  packet  tree.  Coifman  and  Wickerhauser  [19]  have  suggested  the  use 
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of  “entropy”  as  a  measure  of  energy  spread  among  the  transform  coeiBBcients. 
Let  if  be  a  Hilbert  space.  Let  s  €  if,  ||5||  =  1  and  let  if  =  0^ifj  be  an 
orthogonal  decomposition  of  if.  They  define 

£^(s,{fri})  =  -y;i|sifln||s.p  (2.2.11) 

the  entropy  of  s  relative  to  the  decomposition  {ifi}  of  if,  as  a  measure  of  the 
distance  between  s  and  the  orthogonal  decomposition.  For  example,  in  the 
LST  library  case,  one  compares  the  entropy  of  the  expansions  in  two  adjacent 
windows  to  the  entropy  of  the  expansion  in  their  union  and  picks  the  smaller  one, 
continuing  the  comparison  with  the  selection  made  for  the  next  pair,  etc.  Thus 
the  tree  of  basis  functions  is  built  so  that  maximum  energy  compaction  among 
the  fewest  coefficients  is  obtained.  Thus,  the  “best  basis”  paradigm  permits 
a  rapid  (e.g.  O(iVlogiV))  search  among  a  large  collection  of  tree-structured 
orthogonal  bases  to  find  most  compact  representation. 

For  signal  compression  applications,  Vetterli  et  al.  [68]  suggest  the  mini¬ 
mization  of  the  rate-distortion  function  [20]  as  a  criterion  for  basis  tree  selec¬ 
tion.  This  criterion  is  a  compromise  between  description  length  and  distortion 
in  a  compression  scheme  such  as  vector  quantization.  The  WP  tree  is  designed 
to  minimize  this  function.  This  criterion  seems  to  be  appropriate  for  signal 
compression  and  coding  applications. 

Also  for  signal  analysis  and  classification  problems  dominance  of  energy  con¬ 
centrations  in  subbands 

£,  =  l/n^|X,[n]-X,|2  (2.2.12) 

n 

has  been  used  as  a  criterion  for  further  decomposition  [51,  13],  and  the  “Energy 
Map”  is  used  as  a  feature  set.  The  idea  behind  this  approach  is  the  assump¬ 
tion  that  the  most  interesting  features  come  from  high-energy  components  of 
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the  signal.  Figure  2.3  also  illustrated  the  idea  of  energy-based  wavelet  packet 
decomposition. 

2.2.2  Redundant  Dictionaries 

Although  a  signal  can  be  completely  characterized  by  its  decomposition  on  an 
orthogonal  basis,  any  such  basis  may  not  be  rich  enough  to  represent  all  poten¬ 
tially  interesting  microstructures.  As  in  human  language,  a  limited  dictionary  of 
words  may  suffice  for  expressing  any  idea,  using  composite  words  and  sentences; 
but  utilizing  a  more  extensive  dictionary  enables  one  to  find  more  compact  and 
efficient  ways  of  expressing  ideas.  There  is  an  infinite  number  of  ways  to  decom¬ 
pose  a  signal/image  over  a  redundant  dictionary  of  waveforms.  In  fact,  it  can  be 
shown  that  in  a  finite-dimensional  space,  computing  the  optimal  expansion  of 
signals  using  a  redundant  dictionary  of  waveforms  is  an  NP-complete  problem. 
This  justifies  the  use  of  suboptimal  greedy  algorithms.  Thus  an  approximation- 
based  greedy  algorithm  called  matching  pursuit  is  proposed.  The  problem  is  to 
find  the  optimal,  i.e.  most  compact,  decomposition  of  a  signal  /  over  a  dictio¬ 
nary  of  normalized  waveforms/ vectors  D  =  whose  linear  combinations 

are  dense  in  the  signal  space  H.  Matching  Pursuit  is  a  greedy  algorithm  that 
successively  approximates  a  signal  /  with  orthogonal  projections  on  elements  of 
D. 

Let  £T>  .  The  vector/signal  /  can  be  decomposed  into 

/=</,57o  >570+^/  (2.2.13) 

where  Rf  is  the  residual  vector  after  approximating  /  in  the  direction  of  .  The 
iterative  approximation  is  performed  by  successive  selection  of  the  dictionary 
element  closest  to  the  decomposition  residue  at  each  step  and  computing  the 
new  residual  term.  Let  RPf  =  f  and  assume  that  at  the  iteration,  R’^f  has 
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already  been  computed.  We  choose  G  D  that  best  matches  / 

I  <  R’'f,9ik  >  I  =  sup.^erl  <  R''f,9'y  >  I  (2.2.14) 

and  then  project  R^f  onto 

gHlf  ^  (2,2.15) 

which  defines  the  residue  of  the  (A:+ 1)**  order.  The  orthogonality  of  and 

g^^  implies 

I|fi‘«/|p  =  (2-2-16) 

f  =  ”f;<«V,9o.  >97.+^*"/  (2-2.17) 

A:=0 

Thus  R”'f  is  the  approximation  error  of  /  after  n  iterations.  In  fact  the  original 
objective  was  to  minimize  this  error  for  a  fixed  n.  As  part  of  our  discrim¬ 
inant  analysis  we  will  revisit  this  idea  and  exploit  it  for  best  discrimination 
performance.  Details  about  fast  numerical  computation  of  the  matching  pursuit 
algorithm  and  its  orthogonal  version  can  be  found  in  [57]. 

2.3  Discriminant  Local  Basis 

Most  of  the  proposed  basis  selection  algorithms  are  tailored  to  provide  compact 
representations  and  effective  signal  compression.  However,  for  classification  pur¬ 
poses  a  criterion  based  on  the  difference  between  the  patterns/signals  of  different 
classes,  i.e.  class  separability,  is  preferable  [28],  because  one  may  observe  rela¬ 
tively  high  energy  subbands  on  which  the  desired  signals  are  quite  similar  and 
subbands  of  relatively  low  average  energies  that  contain  significant  information 
about  the  differences  between  the  signals.  On  the  other  hand  the  average  energy 
and  second  central  moments  of  the  subbands  may  not  be  the  only /best  feature 
set  for  classification.  For  example,  higher-order  moments  may  be  used  as  part  of 
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a  feature  set  and  in  such  cases  the  decision  criterion  for  further  decomposition 
at  each  level  should  also  take  those  features  into  consideration. 

One  of  the  main  ideas  of  this  study  is  to  investigate  the  effectiveness  of  a 
separability  or  discrimination-based  criterion  for  local  basis  selection.  The  pro¬ 
cess  of  analysis  compares  projections  of  a  set  of  signals  onto  waveforms  of  a 
pre-selected  library  and  picks  up  projections  that  contain  the  most  discrimina¬ 
tory  information.  This  selection  permits  discrimination  of  signals  to  a  specified 
accuracy  with  the  fewest  waveforms.  The  tree  structure  selected  based  on  class 
separability  may  not  be  optimal  or  even  sub-optimal  for  representing  or  approx¬ 
imating  individual  signals  and  it  does  not  even  need  to  provide  a  “complete” 
basis,  as  is  required  for  some  other  tasks,  e.g.  compression,  identification,  and 
modeling. 

In  the  following  we  first  review  the  basic  ideas  of  class  separability  and  its 
measures  and  then  use  those  measures  as  our  criteria  for  basis  selection  from  an 
orthogonal  or  redundant  dictionary  of  waveforms. 

2.3.1  Class  Separability  Measures 

In  order  to  design  an  efficient  classification  system  one  has  to  select  features 
that  are  most  effective  in  showing  the  salient  differences  between  the  signals,  so 
that  signal  clusters  are  well  separated  in  the  feature  space. 

Consider  a  collection  of  N  signals  from  L  different  but  known  classes. 

Feature  extraction  is  a  mapping  from  a  high  and  possibly  infinite-dimensional 
signal  space  to  a  typically  low- dimensional  feature  space: 

r  :  s  €  S  V  e  R”  (2.3.18) 

The  training  set  F  is  a  set  of  prelabeled  observations 

Th  =  {(ui,/i)  :  i  =  l,...,iV  and €  {1,2, ...,jL}}  or  (2.3.19) 
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=  {(vi,  li)  :i  =  l, N  and  /;  C  [0, 1]^} 


where  Th  and  Fs  correspond  to  training  based  on  hard  decisions  or  soft  decisions 
respectively.  While  a  hard  decision  is  a  single  label  assignment,  a  soft  decision  is 
in  the  form  of  a  real- valued  vector  where  each  component  of  this  vector  represents 
the  closeness,  degree  of  membership,  or  similarity  between  an  observation  and 
the  prelabeled  observations  in  the  training  set. 

For  best  feature  extraction  or  evaluation  we  need  a  means  of  quantifying 
the  distance  or  separability  of  the  clusters  corresponding  to  different  classes. 
Let  P{V\Ci)  be  the  conditional  density  function  which  represents  the  spread 
of  feature  points  u  €  V  for  each  class  Ci  defined  in  the  feature  space.  For  L 
different  classes  we  are  seeking  a  measure  of  the  distance  or  separation  among 
the  L  clusters  represented  by  the  P(y|C'/)’s: 

Sep(V,C)  =  i(P(V\C,),P(V\C,) . P(V\Ci.))  (2.3.20) 

Examples  of  quantitative  measures  of  class  separability  (CS)  are  Bayes  error, 
variational  distance,  scatter  matrix  based  measures,  Bhattacharyya  distance 
and  divergence  rate  [34,  24].  Bayes  error  is  the  best  measure  of  separability  of 
distributions  and  for  any  selection  of  features  it  gives  the  minimum  amount  of 
attainable  classification  error.  For  a  two-class  {Ci,  Cj}  problem  with  equal  prior 
probabilities  and  uniform  misclassification  cost  and  feature  vector  V,  the  Bayes 
error  can  be  simplified  to 

/  mmiP{V\Ci),P{V\C^))dV  (2.3.21) 

JV 

where  the  P(*|-)’s  are  conditional  class  density  functions,  shown  in  Figure  2.6. 
One  attempts  to  minimize  this  error  over  different  choices  of  feature  vector  V. 
For  multiple-class  problems  Bayes  error  can  be  defined  similarly  by  the  areas  of 
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Figure  2.6:  Class  separability: 
between-class  scatter 


Class  1 


(b) 


(a)  Bayes  error;  (b)  within-  and 


the  regions  where  the  conditional  distributions  overlap.  Theoretically  speaking 
the  Bayes  error  is  the  optimum  measure  of  feature  effectiveness,  and  despite  its 
computational  complexity,  its  estimated  value  is  a  popular  criterion. 

The  problem  with  divergence  rate  and  its  symmetric  variation  ( J-divergence) 


Divergence  Jij  =  D{P{V\Ci\P{V\Ci))  (2.3.22) 

v&V 

J  -  DivergenceJ®,j  =  Jij  -b  Jj^i  (2.3.23) 

as  measures  of  discrepancy  between  conditional  distributions  [20]  is  that  they 
do  not  have  metric  properties.  Also  since  divergence  is  defined  for  pairs  of 
distributions,  when  the  number  of  classes  is  more  than  two  one  needs  to  consider 
divergences  for  all  pairs  and  use  their  summation: 

<l(P{V\C,),P{V\C2),...,P(V\Ci.))  =  't  E  J'ij  (2-3-24) 

i—1  i<.j<L 

Despite  this  fact  and  the  computational  complexity,  divergence-based  class  sep¬ 
arability  measures  are  sometimes  used  as  alternatives  to  Bayes  error. 

An  elegant  and  yet  simple  way  of  formulating  a  criterion  of  class  separability 
is  based  on  within-  and  between-class  scatter  matrices,  which  are  widely  used 
in  discriminant  analysis  [34].  The  within-class  scatter  matrix  (5^,)  shows  the 
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scatter  of  the  sample  vectors  (F)  of  different  classes  around  their  respective 
mean/expected  vectors  Mf. 

S^  =  j^Pr{C  =  Ci}i:i  (2.3.25) 

fci 

where  S/  =  E\{y  —  Mi){V  —  Mi^lCi]  represents  the  spread  of  feature  points  in 
the  class.  Also  one  can  define  the  between-class  scatter  matrix  (Sb)  as  the 
scatter  of  the  conditional  mean  vectors  Mi  around  the  overall  mean  vector  M: 

Sb  =  j2  -  Mi){M  -  Mif  (2.3.26) 

In  order  to  have  good  separability  for  classification  one  needs  to  have  “large” 
between-class  scatter  and  “small”  within-class  scatter  simultaneously.  There 
are  several  ways  of  defining  a  positive  function  as  a  measure  of  this  combined 
separability  criterion  [34]: 


=  tr(5tt,  ^Sb) 

(2.3.27) 

=  ln|F,,-^Fil 

(2.3.28) 

=  tr(5t)/tr(5^) 

(2.3.29) 

In  our  experiments  is  used  but  the  same  results  hold  for  J^.  We  denote 
the  objective  function  computed  over  subspace  V  hy  Jv-  A  similar  but  simplified 
version  of  this  idea  has  been  used  in  speaker  identification  and  speech  recognition 
problems,  where  it  is  called  the  “F- ratio”  [33]. 

2.3.2  Best  Wavelet  Packets  for  Discrimination 

In  this  section  we  present  our  WP  basis  selection  scheme  which  tries  to  find  the 
best  WP  tree  for  classification  purposes.  First  we  need  to  mention  that  in  the 
following  we  refer  to  each  node  as  a  subband  or  feature  interchangeably  although 
one  may  compute  more  than  one  feature  from  each  subband.  The  algorithm  is 
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based  on  a  divide  and  conquer  approach  similar  to  the  well-known  Best  WP 
basis  selection  proposed  by  Coifman  et  al.  [19],  but  there  are  some  important 
differences. 


Algorithm 

1.  Select  an  appropriate  wavelet/QMF  filter  or  local  sine/cosine  transform. 
Call  the  operation  F. 

2.  Let  be  the  root  node,  let  the  iteration  index  u  =  0,  and  go  through  the 
following  iterations.  Each  iteration  involves  a  decision  about  decomposing 
one  node  from  the  retained  tree. 

3.  Perform  one  level  of  decomposition  on  each  terminal  node/subband  p: 

(2.3.30) 

/  \ 

4.  For  each  parent  node/subband  p  and  its  children  nodes  {cf  \  *  =  1,  Af}, 
compute  the  corresponding  feature  sets.  These  feature  sets  are  typically 
computed  through  simple  nonlinear  operations  and  may  or  may  not  be 
based  on  local  energies. 

5.  Compare  the  Combined  Class  Separability  (CCS)  obtained  using  all  tree 

nodes  selected  so  far,  with  the  parent  node  J(T("\p),  to  the  same 
CCS  excluding  node  p  but  including  all  its  children  nodes  C): 

2-(n+i)  ^  if  J{T'^^\p)>  J{T^^\Q)  (2.3.31) 

T’fn+i)  ^  {r(”),c}  if  j(r(”\p)  <  J(r(”^c) 

In  other  words,  we  decompose  a  node  p  if  this  decomposition  gives  us 
“additional”  significant  discrimination  information. 
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6.  Repeat  steps  3  —  5  for  the  updated  tree;  increase  the  iteration  index  (n  — ^ 
n+1)  until  no  further  significant  improvement  of  separation  is  observed  by 
decomposing  the  terminal  nodes.  One  can  terminate  the  iteration  earlier 
if  the  amount  of  achieved  separation  is  larger  than  a  preselected  threshold. 

7.  Reduce  the  dimensionality  of  the  feature  vectors  using  a  feature  selection 
method  (e.g.  Backward  Elimination,  Forward  Selection,  or  Branch  and 
Bound)  to  sort  the  list  of  features  in  the  order  of  their  CS  information 
importance. 

Splitting  each  subband  increases  both  the  within-  and  between-class  scat¬ 
ters,  so  it  may  or  may  not  result  in  an  increase  of  class  separation  as  defined 
in  (2.3.27).  However,  since  windowing  is  performed  in  the  frequency  domain,  it 
is  more  likely  that  such  an  increase  will  be  observed  at  earlier  levels  of  decom¬ 
position  rather  than  later  stages  where  the  subbands  are  too  small  to  reliably 
characterize  the  differences.  This  observation  and  the  depth  limitation  described 
earlier  explain  how  the  algorithm  terminates. 

Note  that  for  the  special  case  of  additive  separability  cost,  i.e. 

J(yW)  =  ^  J{Vi)  (2.3.32) 

i—1 

(2.3.31)  reduces  to 

r("+i)  =  if  j(p)  >  J(c) 

r(”+i)  =  if  J(p)<J(C)  (2.3.33) 

which  is  consistent  with  [72].  The  choice  of  additive  cost  may  not  be  appropriate 
especially  when  there  is  a  dependency  or  statistical  correlation  between  features. 
For  example,  the  combination  of  two  features  which  carry  significant  but  similar 
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WAVELET  PACKET  BASED  FEATURES 
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Figure  2.7:  Computation  of  feature  vectors  for  corresponding  local 
windows  in  all  subbands. 

discrimination  information  does  not  provide  us  with  twice  the  discrimination 
power. 

Aside  from  the  main  idea  of  the  algorithm,  one  can  argue  about  the  appro¬ 
priate  choice  of  the  feature  set  for  each  node.  Without  claiming  optimality,  as  a 
reasonable  choice,  we  use  features  based  on  central  moments  of  the  correspond¬ 
ing  subband  signals,  e.g. 

"“W  =  t4(  E  (/(*)- W”)"'”  (2.3.34) 

I  I  xew 

V  =  {Vi  =  fl2{Wi),  v'i  =  H2{Wi)lns{Wi)  i  =  0,  1,  ..,  TVsubbands} 

where  W,-  is  the  local  window  on  the  subband.  On  each  subband,  f{x)  and 
fw  are  defined  as  the  intensity  value  at  location  x  and  average  intensity  on 
window  W  centered  at  x  respectively,  as  shown  in  Figure  2.7.  Depending  on 
the  nature  of  the  signal  or  image  classification  task,  W  can  be  a  1-D  or  2-D 
window.  For  segmentation  tasks  the  window  slides  through  the  signal  and  at 
each  location  it  covers  a  part  of  the  signal,  whereas  in  classification  tasks  there  is 
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only  one  window  covering  the  whole  signal.  For  each  subband  or  node,  H2  shows 
the  average  energy  whereas  ^3/^2  roughly  represents  the  information  about  the 
shape  of  the  spectrum  in  that  subband. 

2.3.3  Separability  and  Dimensionality  Reduction 

In  order  to  design  a  simple  and  efficient  classification  and  segmentation  scheme 
one  has  to  select  features  that  are  most  effective  in  showing  the  salient  differ¬ 
ences  between  the  signals,  i.e.  a  selection  that  results  in  the  best  minimal  set  of 
features  in  terms  of  the  separability  of  the  signal  clusters  in  the  feature  space. 
The  reduction  in  dimensionality  of  the  feature  vectors  can  be  achieved  either  by 
selecting  them  or  combining  them  so  that  maximum  classification  information 
is  retained.  We  start  with  the  selection  process  and  then  we  study  Linear  Dis¬ 
criminant  Analysis  as  a  tool  for  obtaining  the  best  linear  combination  weights. 

After  the  full  tree  of  wavelet  basis  functions  is  selected,  to  simplify  the  feature 
vector,  those  nodes  that  do  not  actively  contribute  to  the  overall  classification 
performance  can  be  discarded.  With  this  elimination  process  the  pruned  tree 
will  no  longer  correspond  to  a  “complete”  basis,  but  completeness  is  not  required 
for  analysis  and  classification  purposes. 

The  simplest  but  most  unreliable  method  of  selecting  feature  subsets  (of  size 
m  <  n)  is  to  consider  them  individually  and  select  from  the  top  of  the  list  of 
features,  sorted  based  on  the  cost  for  each  feature  alone. 

Um  =  {ui^i  =  :  J{ui)  >  J{vj)  Mvj  G  (F  —  Um-\)  (2.3.35) 

This  selection  is  optimal  only  if  the  features  are  independent  and  the  cost  func¬ 
tion  is  additive  [24].  In  many  applications  neither  is  the  case.  On  the  other 
hand,  direct  exhaustive  search,  even  for  moderate  sizes  of  feature  sets,  is  com¬ 
putationally  prohibitive.  So  depending  on  the  tolerated  complexity,  one  can  use 
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suboptimal  Forward  Selection  or  Backward  Elimination  methods,  or  so-called 
Branch  and  Bound  search  [34,  24]. 

Iterative  comparisons  can  be  initiated  from  the  complete  set  of  features  (fl) 
by  eliminating  the  one  that  has  the  least  effect  on  the  overall  cost  and  continuing 
the  same  elimination  process  for  the  remaining  set  until  the  minimum  acceptable 
cost  or  maximum  affordable  number  of  features  is  obtained  [34].  We  call  this 
stepwise  process  Backward  Elimination: 

yo  =  fi  =  {ui,U2,...,Un}  (2.3.36) 

yk+1  ^  {y^_arg,_min{J(y")- (2.3.37) 

Also,  one  can  start  with  the  selection  of  a  single  feature  =  {wi}  that 
results  in  the  largest  cost  J{V^).  Then,  fixing  select  from  the  remaining 
features  a  —  {{y^,{«2}}  such  that  it  provides  the  largest  cost  J{V^)  and 
continue  to  include  the  most  effective  combination  [34].  This  is  called  Forward 
Selection: 

y°  =  Null  (2.3.38) 

=  {y^  argmaxaJ(V^  {ui}),u,-  €  (0  -  V^)}}  (2.3.39) 

One  can  also  use  variations  of  the  so-called  Branch  and  Bound  method  of 
selecting  the  best  subset  of  nodes/subbands  [34,  24].  This  approach,  although 
computationally  more  involved,  can  provide  the  optimal  selection  of  nodes  even 
when  there  is  considerable  dependence  among  features  across  nodes.  This  al¬ 
gorithm  is  a  top-down  search  with  backtracking  which  examines  all  possible 
combinations  without  exhaustive  search.  It  is  based  on  the  monotonic  property 
of  the  majority  of  feature  selection  criteria,  namely  for  a  nested  set: 

y(i)  3  y(2)  3  y(3)  3  ,,,  (2.3.40) 
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J(yW)  >  J(y(2))  >  J(y(3)  >  (2.3.41) 

Just  to  illustrate  the  basic  idea  of  pruning,  the  second  approach  is  adopted 
in  the  following.  Using  J  as  our  cla.ss  separability  criterion  the  algorithm  for 
basis  selection  can  be  summarized  as  follows: 

Unlike  Mean  Square  Error(MSE),  which  is  the  most  widely  used  criterion  for 
signal  representation,  class  separability  measures  are  typically  invariant  under 
any  non-singular,  linear  or  non-linear,  transformation.  However,  any  singular 
mapping  used  for  dimensionality  reduction  results  in  losing  some  discriminating 
information.  Our  objective  is  to  find  the  mapping  that  for  a  given  reduction  in 
space  dimensionality  provides  the  maximum  class  separability.  In  other  words, 
we  are  searching  among  all  possible  singular  transformations  for  the  best  sub¬ 
space  which  preserves  class  separability  as  much  as  possible  in  the  lowest  pos¬ 
sible  dimensional  space,  as  illustrated  in  Figure  2.8.  So  we  are  seeking  a  linear 
transformation  A  from  R“  to  with  m  <  n  such  that 

A:ACR“  ycR“  (2.3.42) 

A  =  argmin^^{|Jx  —  (2.3.43) 

where  Jx  =  tr(S''’^)  and  Jy  =  tr(5^)  are  separabilities  computed  over  the  X  and 
Y  =  AFX  spaces  respectively.  Thus  A  optimizes  Jy,  i.e.  minimizes  the  drop  in 
cost  \  Jx  —  Ja‘^x  \  incurred  by  the  reduction  in  the  feature  space  dimensionality. 
It  can  be  shown  that  for  such  an  optimum  A 

{A^i}  C  {A^j}  i  =  l,...,m  ,  j  =  l,...,n  (2.3.44) 

where  the  A^’s  and  A^’s  are  the  eigenvalues  of  the  corresponding  separation 
matrices  and  .  This  observation  and  the  fact  that 

m 

Jy  =tr(5^)  =  X^A^  (2.3.45) 

i=l 
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so  SI  S2  S3 


Figure  2.8:  Dimensionality  reduction  of  the  feature  vectors  obtained 
from  a  balanced  or  pruned  wavelet  packet  tree. 

suggest  that  one  can  maximize  (or  minimize)  Jy  by  taking  the  largest  (or  small¬ 
est)  m  eigenvalues  of  .  Following  our  earlier  observations,  and  having  deter¬ 
mined  the  separation  matrix,  we  perform  eigenvalue  analysis  of  the  separation 
matrix  on  the  augmented  database: 

eig{5^}  =  {(Ai,Ui),i=  l,...,A^s-l,  A,-  >  A^+i}  (2.3.46) 

To  reduce  the  computational  cost  for  large  dataset  sizes  one  can  use  the  following 
equality  [78,  34]: 

SbUi  =  XiS^Ui  (2.3.47) 

This  shows  that  the  Uj’s  and  Aj’s  are  generalized  eigenvectors  of  {S'j,,  S^}-  From 
this  equation  the  Aj’s  can  be  computed  as  the  roots  of  the  characteristic  poly¬ 
nomial 

1^6  -  Aj5,„|  =  0  (2.3.48) 
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and  then  the  t/j’s  can  be  obtained  by  solving 


(Si,  -  XiS^)ui  =  0  (2.3.49) 

only  for  the  selected  largest  eigenvectors  [78].  Note  that  the  dimensionality  m 
of  the  resulting  set  of  feature  vectors  is  m  <  rank(5')  =  min(n,  A^s  —  !)•  Now 
define 

=  {Ai,f  =  iVs-1}  (2.3.50) 

=  {ui,  i  =  1, ...,  m<Ns-l}  (2.3.51) 

so  that  and  represent  the  set  of  m  largest  eigenvalues  of  and 
their  corresponding  eigenvectors.  Considering  as  one  of  the  possible  linear 
transformations  Q,  from  R“  to  R“,  with  m  <n,  one  can  show  that 

n  =  {17;A:cR“^C/^A:  =  FcR“  m<n}  (2.3.52) 
=  a,Tgmmu^^{\Jx  -  JuTx\}  (2.3.53) 

where  Jx  =  tr(5'^^^)  and  Jy  =  tr(5'^^^)  are  separabilities  computed  over  the  X 
and  Y  =  U^X  spaces  respectively.  This  means  that  minimizes  the  drop 
|Sep(X)  —  Sep(t/^A’)|  in  classification  information  incurred  by  the  reduction  in 
the  feature  space  dimensionality,  and  no  other  R“  to  R“  linear  mapping  can 
provide  more  separation  than  does;  thus  A  = 

Therefore,  the  optimal  linear  transformation  from  the  initial  representation 
space  in  R^  to  a  low-dimensional  feature  space  in  R™  based  on  our  selected  sep¬ 
aration  measure  results  from  projecting  the  input  vectors  x  onto  m  eigenvectors 
corresponding  to  the  m  largest  eigenvalues  of  the  separation  matrix  .  These 
optimal  vectors/direction  can  be  obtained  from  a  sufficiently  rich  training  set 
and  can  be  updated  if  needed.  Note  that  the  idea  of  multi-scale  dimensionality 
reduction  can  be  applied  to  multi-scale  classification  systems  regardless  of  the 
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criterion  used  for  basis  selection,  e.g.  it  can  be  used  on  the  pyramidal  wavelet 
transform,  balanced  or  unbalanced  wavelet  packet  tree,  or  local  trigonometric 
functions. 

2.3.4  Separability  and  Redundant  Dictionaries 

Although  all  of  our  discussion  has  been  limited  to  complete  and  orthogonal 
dictionaries  of  bases,  the  idea  of  separability-based  multi-scale  basis  design  can 
also  be  applied  to  non-orthogonal  and  redundant  dictionaries.  In  particular,  if 
the  initial  multi-scale  signal  representation  is  obtained  through  linear  operations 
or  “projections”  [57],  one  can  absorb  the  matrix  A  into  these  operations.  For 
example,  if  projections  of  the  signals  onto  a  set  of  multiscale  “templates”  = 
1, ...,  n}  are  used,  then  application  of  A  to  these  templates,  {A^<pi,  i  =  1, ...,  m  < 
n},  provides  a  small  number  of  “composite  waveforms”  on  which  the  projections 
of  the  input  signals  show  the  largest  differences,  i.e. 

y  =  {Vi}  =  {<s,cf>i>}  (2.3.54) 

U  =  K}  =  AV  =  Av{J  =  {<s,A.?ii>}  (2.3.55) 

The  original  library  of  multi-scale  basis  functions  can  be  a  redundant  dictionary 
composed  of  wavelet  packet  bases,  local  sine/cosine  functions,  or  families  of 
Gabor  functions.  Also  “composite”  signals  generated  using  this  method  are  task- 
dependent  and  do  not  in  general  have  any  specific  structure  like  the  wavelet  tree 
structure.  They  can  be  stored  as  a  set  of  multi-scale  signal  templates/vectors 
to  be  used  in  signal  projection  and  feature  extraction  processes. 

For  example  if  a  set  of  Gabor  functions  $  with  index  set  F  is  used  as  the 
starting  dictionary  of  basis  functions 

=  exp  (-^"  2/^  ^  ^  (27!'/(^  -  d))  (2.3.56) 
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phil  0.5*phi1+0.3*phi2+0.2*phi3 


Figure  2.9:  Obtaining  multiscale  composite  templates  from  1-D  Ga¬ 
bor  functions  using  linear  combinations  corresponding  to  rows  of  the 
dimensionality  reduction  matrix  A. 

^  =  W-y}r  where  F  =  {7}  =  {(a, /,  cZ)}  (2.3.57) 

and  features  are  computed  based  on  inner  products  or  projections,  then  ac¬ 
cording  to  (2.3.54)  a  small  set  of  multiscale  templates  for  classification  can  be 
obtained  based  on  linear  combinations  of  Gabor  wavelets  according  to  the  rows 
of  the  matrix  A.  As  Figure  2.9  shows,  the  resulting  composite  templates  may 
not  be  symmetric  and  may  not  resemble  any  known  local  basis.  An  alternative 
way  of  applying  the  separability  idea  to  redundant  dictionaries  is  a  greedy  algo¬ 
rithm  similar  to  matching  pursuit  proposed  by  [57]  or  a  sequential  multi-scale 
hypothesis  testing  technique  [26]. 

We  can  call  this  method  the  discrimination  or  Separation  Pursuit  (SP) 
method,  which  through  a  greedy  sequential  search  algorithm  similar  to  match¬ 
ing  pursuit,  tries  to  suboptimally  find  the  best  decomposition  for  classification 
purposes.  The  main  difference  between  SP  and  MP  is  that  SP  uses  a  different 
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criterion  that  needs  to  be  evaluated  on  a  set  of  prelabeled  training  functions  F 
rather  than  individual  signals  f  €.  F. 

Let  F  be  a  matrix  whose  columns  are  training  signals/vectors  of  all  L  classes. 
The  problem  is  to  find,  from  among  all  possible  decompositions  of  F  over  a  dic¬ 
tionary  of  normalized  waveforms/vectors  {5'7}76r)  a  decomposition  that  results 
in  projection  coefficients  with  maximum  discriminatory  power.  Like  MP,  define 
BPf  =r  /  as  the  initial  residue  of  decomposition.  Let  €  D  .  Like  matching 
pursuit  at  the  iteration, 

V/  e  F  :  FV  =<  FV,  7k  >  9-y,  +  (2.3.58) 

The  above  equation  can  be  rearranged  and  written  in  vector  form  as 

^fc+iF  =  jik^  _  .R^F.g^,  (2.3.59) 

where  R^F  is  the  matrix  of  all  residue  vectors  and 

R%,=gl.R’^F  (2.3.60) 

is  the  vector  of  projection  coefficients.  The  iterative  information  extraction  is 
performed  by  successive  selection  of  the  most  discriminating  dictionary  element 
for  the  decomposition  residue  at  each  step  and  computing  the  new  residual  term 
according  to  (2.3.59).  The  most  discriminatory  element  of  the  dictionary  can  be 
selected  using  any  of  the  separability  measures  described  in  Section  2.2: 

■jk  =  argmax.ygpJ(i?^FT,)  (2.3.61) 

Fast  numerical  computation  of  the  SP  algorithm  and  its  orthogonal  version 
parallels  those  of  MP  and  can  be  implemented  according  to  [57]. 
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Chapter  3 

Multisource  Soft  Decision  Integration 

3.1  Introduction 

After  extracting  multiscale  discriminant  features  we  need  to  find  an  effective 
framework  for  decision  making.  Effective  is  meant  here  in  the  sense  that  in 
the  process  of  classification  or  recognition  the  system  takes  advantage  of  all  the 
relevant  information  which  is  explicitly  or  implicitly  embedded  in  the  feature 
space.  In  fact  it  has  been  argued  and  shown  that  an  important  factor  which 
typically  degrades  the  classification  and  recognition  performance  of  most  sys¬ 
tems  lies  is  the  loss  of  information  as  a  result  of  under-utilization  of  information 
in  the  feature  space  [86].  This  fact  and  the  principle  of  least  commitment  sug¬ 
gest  utilizing  soft  decisions  as  a  more  informative  representation  of  intermediate 
decisions,  and  carrying  soft  decisions  along  until  a  crisp  decision  is  required. 

Consider  a  general  pattern  classification/segmentation  problem  with  L  dif¬ 
ferent  classes,  based  on  m,  possibly  imprecise,  sources  with  relative  levels  of 
expertise  denoted  by  a’s.  Let  {a;,}  be  a  set  of  arranged/ordered  observations  in 
time  or  space.  These  observations  may  be  obtained  from  sliding  windows  that 
span  the  signal  or  image.  For  example,  they  may  correspond  to  the  successive 
1-D  windows  used  for  speech  recognition,  or  to  the  2-D  windows  of  an  image 
segmentation  system. 

Consider  a  collection  of  N  examples  from  L  different,  but  known, 

classes.  Feature  extraction  is  a  mapping  from  signal  space  to  feature  space: 

T  lu  en  (s.i.i) 
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so  the  training  set  F  is  a  set  of  pairs 


Fs  =  {(xi,/i)  :  i  =  1,...,  AT  and /i  C  [0, 1]^}  or 

Tfe  =  :  z  =  and /,•  G  {1,2, (3.1.2) 

where  Fs  and  Th  correspond  to  training  based  on  soft  decisions  or  hard  decisions 
respectively.  Equivalently,  we  let  xj  =  denote  measurements,  e.g.  dis¬ 

criminant  feature  values  or  vectors,  obtained  from  a  source  s.  The  soft  classifier 
is  a  map  F(.),  typically  non-linear,  from  the  feature  space  X  to  the  points  in 
the  “fuzzy”  cube  [0,1]'^.  Thus 


-4  [0,1]^  (3.1.3) 

d\xi)  =  dl  (3.1.4) 

d|  =  d^{xi)  is  a  decision  based  on  measurement  Xi  from  source  s.  In  general  this 
decision  is  a  vector  of  size  L,  whose  element  shows  the  decision  (or  in  fuzzy 
terms,  the  fit  value)  associated  with  class  j: 

d^{x,)  =  [d^ixi,Cj)]f^^  (3.1.5) 

Thus,  the  classifier  has  L  non-binary  outputs,  one  for  each  class,  where  each 
output  takes  values  in  [0,1],  Figure  3.1.  Some  authors  put  a  constraint  on  the 
summation  of  the  soft  decisions  made  for  all  classes: 

X:d^(xi,c,)  =  l  (3.1.6) 

i=i 

These  conditions  restrict  the  decision  points  to  a  hyperplane  in  the  T-dimensional 
decision  space. 

Now  let  di  =  g{{di,as  :  s  €  S'})  be  the  decision  based  on  the  consensus  of 
all  sources,  each  of  which  may  be  imprecise  with  reliability  a*.  The  decision 
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Figure  3.1:  Soft  decisions  for  L  classes  are  vectors  in  an  L- 
dimensional  space. 

integration  is  finding  the  best  function  g  that,  through  effective  combination  of 
the  decisions  obtained  from  the  individual  sources,  based  on  a  consensus  rule, 
achieves  a  more  reliable  result. 

Based  on  the  temporal  or  spatial  arrangement  and  interrelationships  of  the 
observations  {a;,},  one  can  define  a  notion  of  neighborhood  or  context  area 
around  each  window  of  observation.  Let  Di  =  D{iOi)  be  the  final  decision  about 
event  iVi  which  is  a  function  of  the  decisions  obtained  from  other  windows  in 
the  context  area  of  u;,-,  and  possibly  other  information  Z,  i.e.  Di  =  h{{dj  :  j  G 
Ni},Z).  Context-dependent  classification  and  recognition  involves  the  defini¬ 
tion  of  function  h  based  on  a  reasonable  assumption  about  the  interrelation  of 
observations  within  an  area/interval. 

In  this  chapter  we  first  discuss  the  issue  of  similarity-based  soft /fuzzy  clas¬ 
sification  based  on  a  single  source.  Then  we  talk  about  consensus  of  experts 
through  decision  integration  using  objectively  defined  measures  of  the  reliabil¬ 
ity  or  importance  of  information  sources.  We  incorporate  the  spatial/temporal 
context  information  through  defining  a  relevance  function  that  describes  the 
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interrelationships  among  observations  within  a  neighborhood. 

3.2  Fuzzy  Partitioning  of  Feature  Space 

Let  X  =  {x}  be  a  universe  of  discourse  with  generic  elements  denoted  by  x. 
Membership  in  a  classical  set  A  of  X  is  often  viewed  as  a  characteristic  function 

Xa  '■  X  {0, 1}  :  such  that  Xa{x)  =  I  ^  x  ^  A  (3.2.7) 

which  assumes  that  the  set  has  precisely  defined  boundaries  and  each  element 
(i.e.,  each  observed  example)  has  either  full  or  no  membership  in  set  A.  This 
assumption  results  in  hard  partitioning  of  the  feature  space,  and  as  we  will 
discuss  later,  there  is  a  loss  associated  with  such  partitioning. 

A  fuzzy  set  B  is,  on  the  other  hand,  characterized  by  a  function  /g  which 
associates  with  each  x  a  real  number  in  [0, 1]  that  represents  the  “grade  of 
membership”  of  x  in  B.  The  closer  the  value  of  /g  is  to  1,  the  more  x  belongs 
to  set  B.  So,  while  in  hard  decision  each  observation  is  labeled  as  one  of  the 
possible  classes,  soft  classification  attaches  to  each  observed  pattern  a  group  of 
membership  grades.  The  fuzzy  set  membership  functions  simply  but  efficiently 
encode  a  complete  ordering  among  the  set  elements.  Such  orderings  carry  a  lot 
of  information  about  the  relative  location  of  an  observation/measurement  in  the 
feature  space  with  respect  to  clusters  of  prelabeled  data.  They  are  also  useful 
for  discriminating  between  values  in  relation  with  a  variety  of  semantics  (e.g. 
preference,  uncertainty,  or  similarity)  that  a  fuzzy  set  based  representation  may 
bear  in  different  tasks. 

In  this  chapter  our  study  of  fuzzy  memberships  and  soft  decisions  is  mostly 
related  to  grades  of  similarity  and  dissimilarity  suggested  by  a  group  of  experts 
or  classification  resources.  In  this  context  the  elements  with  membership  1  are 
viewed  as  prototype  elements  of  the  fuzzy  set,  while  other  membership  grades 
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Fuzzy  Area 
(a) 

Figure  3.2:  Hard  partitioning  (a)  and  soft  partitioning  (b)  of  the 
feature  space. 

estimate  the  closeness  of  the  elements  to  the  prototypes.  An  observation  may 
belong  to  a  class  to  some  extent  and  meanwhile  belong  to  another  class  to 
another  extent,  and  membership  grades  are  attached  to  quantitatively  indicate 
these  extents.  Such  a  partition  is  referred  to  as  a  fuzzy  or  soft  partition  of  the 
feature  space.  Formally,  a  fuzzy  partition  of  a  feature  space  is  a  family  of  fuzzy 
sets  {Ci,i  =  on  universe  X  such  that 


Vxex  o</c..  <  1 


(3.2.8) 


xex 


=  1 

Z=1 


(3.2.9) 

(3.2.10) 


In  a  multidimensional  feature  space  the  concept  of  fuzzy  membership  is 
equivalent  to  soft /fuzzy  partitioning  of  the  space,  where  decision  regions  are  not 
separated  by  sharp  hyperplanes,  but  there  are  transition  or  fuzzy  areas  between 
any  two  decision  regions.  Figure  3.2  schematically  shows  how  a  soft /fuzzy  par¬ 
titioning  of  feature  space  may  represent  the  memberships  and  similarities  more 


realistically  than  a  hard  decision.  This  allows  for  classification  information  to 
be  utilized  in  subsequent  analysis. 


3.2.1  Learning  Membership  Functions 

After  considering  the  effectiveness  of  soft  decisions,  one  has  to  devise  systematic 
approaches  to  training  the  classifier  to  form  the  required  soft  decision  bound¬ 
aries.  Here  we  mention  two  major  approaches  to  creating  such  membership 
functions.  The  first  method  relies  on  probability  measures  of  fuzzy  events  and 
in  particular  on  the  so  called  fuzzy  mean  and  fuzzy  variance  of  a  fuzzy  set  [86]: 


f^c 

S* 


EU  U^i) 


(3.2.11) 

(3.2.12) 


Note  that  these  definitions  are  different  from  their  classical  counterparts  in  that 
each  example  contributes  to  the  mean  and  variance  of  a  class  based  on  its  partial 
membership  in  that  class. 

After  estimating  the  mean  and  variance  based  on  a  prelabeled  training  set, 
and  assuming  that  the  cluster  of  points  for  each  fuzzy  set  follows  a  normal 
distribution,  one  defines  a  Gaussian-shaped  membership  function  as  [86] 


fc{x)  =  (3.2.13) 

=  ro  ~  -  p;))  (3.2.14) 

[Att)  |L;  1^^ 

Based  on  the  same  idea  and  using  fx*  and  E*  one  can  define  other  types  of 
membership  functions,  e.g.  triangular,  exponential,  or  trapezoidal. 

It  has  been  argued  that  including  some  of  the  mixed  classes  in  the  training  set 
with  their  corresponding  best  mixed  labels  helps  in  terms  of  better  estimating 
the  mean  and  variance  and  therefore  in  the  final  performance.  Including  such 
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Figure  3.3:  Each  node  in  the  hidden  layer  of  a  MLP  network  forms 
a  “soft  hyperplane”  in  the  feature  space:  (a)  a  basic  connection  in 
MLP,  (b)  the  corresponding  soft  “hyperplane” 


fuzzy  cases  in  the  process  of  training  is  sometimes  referred  to  as  a  fuzzy  training 
method.  Obviously,  reasonable  fuzzy  training  requires  a  methodology  of  defining 
membership  values  for  the  training  set. 

An  alternative  approach  to  adaptively  defining  membership  functions  is  to 
use  the  nonlinear  mapping  characteristics  of  Multilayer  Neural  Networks  (MLNN) 
and  supervised  learning  algorithms  to  learn  multidimensional  membership  func¬ 
tions  based  on  a  training  set. 

Consider  a  simple  neural  network  with  input  layer  X,  connection  weight 
matrix  Wl,  and  step  function  non-linearity  for  hidden  and  output  nodes  Z.  As 
shown  in  Figure  3.3,  each  hidden  node  i  in  the  first  layer  represents  a  hyperplane 
Wf^X  in  the  space  spanned  by  the  input  feature  vectors.  With  sigmoidal  non- 
linearities  at  each  node  the  hyperplane  becomes  a  fuzzy  hyperplane: 


Z  =  Sigm(W^.X)  where  (3.2.15) 

Sigm(j/)  =  \ - r  (3.2.16) 

l+exp(-y) 

In  a  three-layer  network  these  hyperplanes  can  be  combined  to  form  any 
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Figure  3.4:  Creating  soft  decision  boundaries  with  neural  networks: 

(a)  a  two-layer  neural  network,  (b)  the  corresponding  decision  region. 

set  of  convex  soft  decision  boundaxies  (see  Figure  3.4).  Also  with  three  layers, 
even  non-convex  fuzzy  decision  regions  can  be  formed.  This  process  of  creating 
decision  regions  is  very  similar  to  linear  programming  ideas,  except  that  they 
involve  a  mild  form  of  non-linearity. 

This  neural  network  based  approach  provides  a  flexible  framework  for  imple¬ 
menting  fuzzy  training  ideas.  This  type  of  training  also  requires  the  input-output 
pairs  for  all  training,  including  mixed/fuzzy  examples  for  which  one  needs  to 
define  a  criterion  for  membership  assignments.  Note  that  this  membership  as¬ 
signment  has  to  be  consistently  and  mathematically  defined  and  applied  to  the 
training  sets.  Including  such  examples  in  the  training  set  provides  the  network 
training  algorithm  with  valuable  information  about  the  slope  of  the  membership 
function  in  the  transition  regions. 

3.3  Multisource  Soft  Decision  Integration 

A  number  of  different  approaches  have  been  proposed  for  analyzing  information 
obtained  from  several  sources  [52,  8,  34].  The  simplest  method  is  to  form  an 
extended  data/feature  vector,  containing  information  from  all  the  sources,  and 


treat  this  vector  as  the  vector  output  of  a  single  source.  Usually,  in  such  sys¬ 
tems  all  similarities  and  distances  are  measured  in  the  Euclidean  sense.  This 
approach  can  be  computationally  expensive;  it  is  successful  only  when  all  the 
sources  have  similar  statistical  characteristics  and  comparable  reliabilities.  In 
many  application  this  assumption  is  not  valid  and  therefore  a  more  intelligent 
alternative  approach  has  to  be  taken.  In  fact  there  is  a  research  field  called 
Consensus  Theory  that  deals  with  finding  consensuses,  among  members  of  a 
group  of  experts/sources  and  studies  desired  and  undesired  characteristics  of 
consensus  rules  [7,  46]. 

Consider  a  consensus  rule  Cs  for  n  data  sources  with  probability  measures 

{Pl,P2,-,Pm}- 

Cs  :  5)]"  ^  P(n,  S)  (3.3.17) 

where  P(fi,  S)  is  the  space  of  all  probability  measures  with  cr-algebra  S.  There 
are  several  properties  that  are  reasonable  or  desirable  for  a  consensus  rule,  for 
example: 

•  Marginalization  Property  (MP) 

Cs{(n,P2,-,Pr^)\T)=Cs(pi\T,p2\T,...,p„\T)  (3.3.18) 

•  Null  Set  Property  (NSP) 

p^{X)  =  P2{X)  =  ...  =  prr^ix)  =  0  ^  Cs{pt,P2:  ...,Pm){X)  =  0  (3.3.19) 

•  Weak  Setwise  Function  Property  (WSFP) 

Cs(puP2,-,P„)(X]  =  F(MX),p,{X),-,P^(X),X)  (3.3.20) 
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•  Strong  Setwise  Function  Property  (SSFP),  also  called  strong  label  neu¬ 
trality  or  the  context-free  assumption 

Cs{pi,P2,-,Pm){X)  =  G{pi{X),p2{X),  ...,Pm{X))  (3.3.21) 

In  [7]  the  relationships  among  these  properties  are  studied  and  the  following 
theorem  is  proved: 

Theorem:  Suppose  there  is  a  family  of  consensus  rules  {Cs}  in  Q;  then 

1.  MP  is  equivalent  to  WSFP, 

2.  (MP  and  NSP)  is  equivalent  to  SSFP 

3.  SSFP  is  achieved  if  and  only  if  there  exist  non-negative  numbers  (weights) 

with  =  1  such  that  for  all  cr-algebras  S  with  X  €  S  and  all 

Pi  e  P(n,5), 

m 

Cs{pi,P2,-,Pm){X)  =  Y^aiPiiX)  (3.3.22) 

2=1 

This  summation  represents  the  so  called  Linear  Opinion  Pool  (LIOP),  which 
is  one  of  the  most  commonly  used  consensus  rules.  This  rule  has  a  number  of 
advantages  and  disadvantages.  It  is  simple,  it  yields  a  probability  distribution, 
and  it  has  the  MP  and  NSP  properties.  The  weights  {o:,}  in  this  rule  have  an 
intuitive  interpretation  of  relative  importance  or  reliability  of  sources.  There  are 
also  has  some  disadvantages;  for  example,  the  LIOP  is  not  externally  Bayesian, 
i.e.  an  LIOP  based  decision  maker  does  not  necessarily  satisfy  Bayesian  rules.  In 
order  to  avoid  some  of  the  shortcomings  of  LIOP,  some  authors  have  discussed 
the  application  of  the  LoGarithmic  Opinion  Pool  (LGOP) 


Cs{Pl,P2,:.,Pm)iX)  = 


llT=l{Pi{X)r 


(3.3.23) 


mT=i{pi{x)r'dp 

where  =  1.  It  has  been  argued  that  the  result  of  LGOP  is  unimodal 

and  less  dispersed  than  that  of  LIOP.  It  is  externally  Bayesian,  but  it  assumes 
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the  independence  of  sources.  Also  it  has  some  disadvantages,  e.g.  it  considers 
“zero”  opinions  as  vetoes,  and  it  is  computationally  more  complex  than  LIOP. 
Because  of  the  product  form  of  this  rule,  the  weighting  factors  in  LGOP  have 
less  intuitive  interpretations. 

One  of  the  main  problems  with  these  consensus  rules  is  the  selection  of 
weights.  The  weights  should  represent  an  objective  measure  of  relative  impor¬ 
tance  and  expertise  of  sources. 

In  part  of  our  study  we  will  use  LIOP  and  LGOP  for  decision  integration. 
Our  sources  are  multiscale  features  and  their  reliabilities  are  their  normalized 
discrimination  powers. 

In  our  analysis  our  decisions  are  based  on  similarity  and  dissimilarity  mea¬ 
sures  rather  than  model-based  probabilistic  measures.  In  this  context  the  con¬ 
sensus  rules  are  less  restricted;  for  example,  they  do  not  have  to  provide  prob¬ 
ability  distributions  and  they  may  not  necessarily  satisfy  Bayesian  rules.  For 
clarity,  in  the  remainder  of  this  section  we  explain  our  methodologies  based  on 
a  specific  set  of  feature  sets  as  sources  with  defined  similarity  and  reliability 
measures. 

Following  our  projection-based  feature  extraction,  each  projection  of  the  in¬ 
put  pattern  onto  a  discriminant  vector  Uj  creates  a  resource  for  classification 
information  and  therefore  a  decision  axis  with  a  certain  level  of  reliability  and 
discriminatory  power.  The  level  of  significance  or  reliability  a,-  of  the  decisions 
based  on  Ui  is  directly  related  to  the  class  separation  along  that  axis  which  is 
equal  to  the  corresponding  (normalized)  eigenvalue  in  the  LDA: 

V(A.-,u.)€(A<“)xtl(”'):  =  (3.3.24) 

Z^j=l 

For  any  test  vectorized  input  pattern/image  we  project  it  onto  each  of 
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d^(^.  Cl)  dj((t),  C4) 


Figure  3.5:  The  raw  distances  between  each  test  example  and  the 
known  clusters  along  each  discriminant  axis  result  in  the  soft  decision 
along  that  axis. 


the  top  discriminant  vectors  u.  Based  on  the  distances  between  the  resulting 
coefficients  <f>{u)  and  those  of  the  existing  templates  stored  in  the  database, 
we  estimate  the  level  of  similarity  of  the  input  image  to  each  known  class  (see 
Figure  3.5): 


Vu  G 
Vce  C 


(j>{u)  =<  <I)^U> 

Au{4>,c)  =  \<l>iu)  - 
^  _  Au{(j),c) 


(3.3.25) 

(3.3.26) 

(3.3.27) 


where  7r„(^,  c)  reflects  the  relative  level  of  similarity  between  input  (f)  and  class 
c  according  to  source  s  =  u  which  has  reliability  a„.  Using  our  initial  notation 
for  soft  decisions,  we  can  put  the  7r„(^,  c)’s  into  a  decision  vector 


=  (3.3.28) 

Having  determined  our  decision  axis  and  the  reliabilities,  we  can  apply  a 
probabilistic  or  an  evidential  scheme  of  multi-source  data  analysis  to  combine 
the  soft  decisions  made  based  on  the  individual  imprecise  sources  to  obtain  a 
more  precise  and  reliable  final  result.  The  normalized  similarity  measures  (tt’s) 
indicate  the  proportions  of  evidence  suggested  by  different  sources.  They  can 
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be  interpreted  as  the  so-called  basic  masses  of  evidence  or  they  can  be  used  as 
rough  estimates  of  posterior  probabilities  given  each  measurement.  From  this 
stage  on,  a  probabilistic  or  an  evidential  reasoning  approach  can  be  taken  to 
combine  the  basic  soft  decisions.  A  comparative  study  of  various  probabilistic 
and  evidential  reasoning  schemes  is  given  in  [52]. 

Similarly,  working  with  distances  as  dissimilarity  measures,  one  can  combine 
basic  soft  decisions,  and  incorporate  the  reliability  of  each  source,  to  define  a 
reasonable  measure  of  distance  in  the  feature  space.  Although  the  most  com¬ 
mon  measure  used  in  the  literature  is  Euclidean  distance,  as  a  more  reasonable 
measure  we  suggest  a  weighted  mean  absolute/square  distance,  with  the  weights 
based  on  the  discriminatory  powers.  In  other  words. 


c) 

D{<l>,c) 


(3.3.29) 

(3.3.30) 


Therefore,  for  a  given  input  (j)  the  best  match  c°  and  its  confidence  measure  is 


c°  =  (3.3.31) 

where  c'  is  the  second  best  candidate.  In  this  framework,  incorporating  collateral 
information  or  prior  knowledge  and  expectations  from  context  becomes  very 
easy  and  logical.  All  we  need  to  do  is  to  consider  each  of  them  as  an  additional 
source  of  information  corresponding  to  a  decision  axis  with  a  certain  reliability 
and  include  it  in  the  decision  process. 

3.3.1  Incorporating  Spatial/Temporal  Context  Information 

Many  signal/image  processing  tasks  consist  of  local  processing  of  data  followed 
by  a  combination  of  results  obtained  from  the  local  windows.  The  windowing 
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approach  is  sometimes  used  because  of  hardware  limitations  when  considering 
large  data  sets  or  because  of  non-stationarity  of  the  data.  Local  decisions  made 
over  small  windows  are  myopic  and  are  not  reliable  on  their  own.  So  one  needs 
to  devise  methods  of  resolving  the  ambiguity  and  fuzziness  of  local  decisions 
in  a  consistent  way.  In  many  segmentation/recognition  tasks,  sliding  windows 
form  a  set  of  ordered  observations  about  the  signal/pattern.  These  sliding  1-D 
or  2-D  windows  may  partially  overlap  each  other.  Also,  in  a  multiresolution 
analysis  there  are  sliding  windows  of  variable  sizes  and  scales  that  cover  various 
parts  of  the  signal.  Based  on  the  common  coverage  area  of  the  windows  one  can 
define  degrees  of  relevance  and  interrelationship  among  a  set  of  observations  in 
a  neighborhood. 

In  our  approach  soft  “decision  vectors”  computed  for  each  block  are  in¬ 
tegrated  through  weighted  combination  of  decisions/ votes  obtained  indepen¬ 
dently  from  neighboring  blocks.  The  alternative,  viz.  using  large  windows,  is 
not  recommended  because  over  larger  windows  signals  are  highly  non-stationary 
and  the  corresponding  features  result  from  averaging  over  heterogeneous  micro¬ 
structures  and  therefore  are  less  reliable.  Large  windows  provide  less  spatial 
resolution,  which  is  of  great  concern  in  segmentation  of  signals  and  images. 

Our  decision  integration  scheme  combines  context  information  from  various 
sources  based  on  their  degrees  of  relevance  R.  For  example,  in  terms  of  tempo¬ 
ral/spatial  context  we  can  write 

Wuen  D{lo)=  Y,  (3.3.33) 

w'eNu, 

where  Ng  is  a  neighborhood  around  the  point  s  and  D{s)  represents  the  decision 
vector  at  s.  For  temporal  processing  this  degree  of  relevance  may  correspond  to 
the  overlap  of  intervals  covered  by  adjacent  time  windows.  Likewise,  in  terms 
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of  spatial  context,  assuming  the  information  contained  in  each  block  about  a 
region  is  proportional  to  the  area  of  their  overlap,  one  can  write 


R{uj,u') 


A(W^  n  w^,) 

A{W^) 


(3.3.34) 


where  A(.)  is  a  function  representing  the  area  of  its  argument.  As  the  image  is 
analyzed  by  windows  of  size  W  in  window  shift  steps  of  size  w,  the  area  contained 
in  each  block  centered  at  point  u  is  partially  covered  by  neighboring  blocks 
{W(ji  :  u)'  £  N^}  and  contributes  to  their  classifications.  In  the  case  of  a  2-D 
sliding  window  on  an  image,  it  can  be  shown  that 


R{u,oj')  =  1  —  (|i|  +  \j\)wlW  +  \ij\{wlWY  for  —Wlw  <  i,j  <  W/w 
D{uj)  =  D{iji!)  +  R{u},u!')  X  D{io')  (3.3.35) 

where  (^,i)  =  uj  —  u}'.  Thus,  after  one  complete  scan  of  the  image,  the  con¬ 
tributions  of  all  neighboring  blocks  are  added,  and  a  combined  vote  for  each 
macro-pixel  of  width  w  is  obtained.  Note  that,  following  the  principle  of  least 
commitment,  thus  far  we  have  expressed  all  “decisions”  as  real  vectors  and  no 
hard  decision  has  been  made. 

Multi-resolution  analysis  of  data  (images)  combines  the  results  of  classifi¬ 
cations  obtained  at  several  scales.  Classification  is  typically  done  from  coarse 
to  fine.  We  start  with  the  low-resolution  data  to  perform  classification  and 
use  higher-resolution  data  when  the  confidence  level  obtained  is  not  satisfac¬ 
tory.  The  combination  of  decisions  can  be  performed  based  on  our  assumption 
about  the  spatial  relevance  function,  using  the  fact  that  the  windows  on  the  low- 
resolution  signal  are  actually  projections  of  larger  areas  on  the  high-resolution 
view.  In  other  words,  one  can  combine  decisions  obtained  at  different  scales 
based  on  their  discrimination  power  and  relevance  to  each  block.  The  final  ma- 
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jority  votes  and  their  confidence  measures  are  based  on  the  accumulation  of  soft 
decisions  within  and  across  scales  and  the  closeness  of  the  best  class  candidates. 

The  combination  of  weighted  soft  decisions  is  less  susceptible  to  error  than 
is  each  individual  local  vote.  Note  that  the  idea  of  soft  decision  propagation 
and  integration  within  and  across  scales  is  dual  to  the  lateral  inhibition  between 
decision  units  involved  in  one  or  several  scales.  The  role  of  decision  propagation 
profiles  is  similar,  but  not  identical,  to  the  role  of  inhibition  profiles;  one  is 
democratic  while  the  other  is  competitive. 
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Chapter  4 

Signal  and  Image  Classification 

4.1  Introduction 

During  the  last  three  decades,  there  have  been  many  studies  of  classification  and 
segmentation  of  signals  and  images.  A  variety  of  descriptors  based  on  statisti¬ 
cal,  structural  and  spectral  properties  of  the  single  or  multidimensional  signals 
are  utilized  to  form  the  best  sets  of  discriminant  features.  Parametric  methods 
based  on  hidden  Markov  models  [42]  and  Markov  random  fields  [16],  time/spatial 
domain  approaches  based  on  higher  order  moments  [62] ,  co-occurrence  and  corre¬ 
lation  matrices  [17],  and  frequency  domain/filtering  methods  [38,  44]  are  among 
the  major  suggested  schemes. 

Also,  different  families  of  multiscale  decompositions  including  WT,  WP  and 
Gabor  filtering  have  been  successfully  applied  to  various  classification  and  recog¬ 
nition  tasks  [13, 51].  Most  of  the  proposed  multiscale  approaches  to  classification 
problems  are  based  on  decompositions,  either  independent  of  signal  character¬ 
istics  or  based  on  an  energy  or  representation  criterion. 

Based  on  our  analytical  results  in  Chapter  2,  our  objective  in  this  chapter 
is  show  how  adaptive  discrimination  based  WP  features  can  be  used  to  de¬ 
sign  efficient  and  yet  simple  signal  and  image  classification  systems  with  very 
small-dimensional  feature  vectors.  To  show  the  effectiveness  of  our  ideas  for 
real  signal  and  image  classification  and  segmentation  tasks,  we  will  apply  them 
to  Automatic  Target  Recognition  (ATR)  and  texture  segmentation  tasks.  In 
these  tests  a  set  of  real-aperture  radar  returns  are  used  as  examples  of  1-D 
signals  and  a  set  of  standard  textures  are  used  as  a  framework  for  2-D  image 


53 


classification  /  segmentation. 

In  the  tests  described  in  this  and  the  following  chapters  we  have  used  different 
QMF  filters  given  in  [2];  the  results  show  that  the  choice  of  filters  may  have 
minor  effects  on  the  intermediate  results,  but  plays  an  insignificant  role  in  the 
final  performance. 

4.2  Classification  of  Radar  Signatures 

To  show  the  effectiveness  of  the  suggested  feature  extraction  process  in  the  dis¬ 
crimination  of  one- dimensional  signals,  we  applied  it  to  the  classification  of  radar 
target  signatures  using  the  database  provided  as  part  of  the  ARPA  University 
ATR  initiative.  In  this  section,  without  going  into  details  about  the  theory  of 
radar  signatures,  we  use  them  as  a  framework  for  testing  our  scheme. 

Millimeter  Wave  (MMW)  Real-Aperture  Radar  (RAR)  signatures  play  an 
important  role  in  automatic  target  recognition.  Due  to  their  high  range  reso¬ 
lution,  RAR  signals  can  resolve  tactical  targets  at  ranges  of  several  kilometers. 
On  the  other  hand  MMW  radar  range  profiles  are  very  noisy  and  their  dominant 
peaks  axe  sensitive  to  clutter  and  small  changes  in  aspect  angle.  Therefore  RAR 
contains  valuable  information  which  is  difficult  to  extract.  It  has  been  argued 
that  some  of  the  difficulties  may  be  overcome  by  using  multiscale  features. 

The  radar  is  transmitted  in  Right  Hand  Circular  Polarization  (RHCP)  and 
received  both  in  RHCP  and  LHCP,  so  each  RAR  return  consists  of  two  images, 
right-right  (even)  and  right-left  (odd)  polarizations.  The  RAR  data  consists  of 
FFT  magnitude  range  profiles  for  each  of  five  different  targets.  There  are  a  total 
of  128  range  profiles  each  with  128  resolution  cells/samples.  The  targets  are  a 
T-72  Tank,  a  ZIL  truck,  an  ASTRO  multiple  missile  launcher,  a  TZM,  and  a 
BTR60  armored  personnel  carrier. 
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Resolution  Samples 


Figure  4.1:  Example  of  radar  target  signatures  for  five  different 
classes  of  targets. 

There  are  four  different  views  of  each  target:  nose  (0°),  right  side  (90°),  tail 
(180°),  and  left  side  (270°)  views.  80  radar  returns  from  five  different  stationary 
targets  were  used.  For  each  target  there  are  two  radar  returns  for  each  of  the 
two  polarizations  and  the  four  view  angles.  Since  the  targets  are  assumed  to  be 
stationary,  and  in  order  to  reduce  noise,  the  average  of  every  32  channels  was 
used.  The  data  is  divided  into  training  and  test  sets.  Figure  4.1  shows  examples 
of  such  averaged  signatures  used  in  the  classification  test. 

In  these  tests  the  idea  of  dimensionality  reduction  is  applied  to  a  two-level 
balanced  wavelet  packet  tree.  For  each  subband/node,  second  and  third  central 
moments  are  computed  and  {ij.2-,  jJ-zj  1^2}  is  used  as  a  feature  vector.  Figure  4.2 
illustrates  the  separated  clusters  for  five  classes  of  radar  targets  where  only  the 
two  most  important  features  are  used.  All  16  radar  signatures  corresponding  to 
one  target  are  considered  to  be  in  one  class.  As  Figure  4.2  shows,  classification 
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Figure  4.2:  Clusters  of  feature  points  corresponding  to  five  different 
classes  of  targets  separated  in  the  selected  2-D  feature  spaces:  best 
two  features  (top),  second  best  two  features  (bottom). 

can  be  performed  easily  even  with  linear  classifiers,  and  the  distance  between 
clusters  allows  us  to  achieve  good  classification  results  even  in  the  presence  of 
small  Gaussian  noise.  For  more  details  about  this  dataset  see  [27]. 

In  this  test  a  simple  neural  network  is  used  as  a  “soft  classifier” .  The  network 
has  two  input,  three  hidden  and  five  output  units  for  five  classes  of  targets. 
Results  show  about  1%  error  on  the  training  set  and  about  2%  on  the  test  set. 
The  confusion  matrix  is  shown  in  Table  4.1.  The  training  and  test  sets  were 
similar  because  all  targets  were  stationary  and  there  were  small  changes  across 
channels.  This  example  shows  how  one  can  design  a  very  simple  and  efi&cient 
classification  system  for  a  specific  task. 
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Targets 

T-72 

ZIL 

ASTRO 

TZM 

BTR60 

T-72 

20 

0 

0 

0 

0 

ZIL 

0 

19 

0 

1 

0 

ASTRO 

0 

0 

20 

0 

0 

TZM 

0 

1 

0 

19 

0 

BTR60 

0 

0 

0 

0 

20 

Table  4.1:  Confusion  matrix  in  the  radar  signature  classification  test. 
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Figure  4.3:  Some  of  the  textures  used  in  the  classification  experi¬ 
ments. 

4.3  Texture  Classification 

The  effectiveness  of  the  suggested  basis  selection  is  further  illustrated  by  ap¬ 
plying  it  to  image  texture  classification  tasks.  The  input  data  consists  of  ten 
textured  images  shown  in  Figure  4.3.  Feature  vectors  are  computed  from  the 
second  and  third  central  moments  (/i2  and  fisj fi2  respectively)  of  the  image  sub¬ 
bands.  Each  of  the  training  and  test  sets  consists  of  about  100  image  samples 
of  each  texture,  selected  randomly  from  512  x  512  texture  images.  Each  tex¬ 
ture  sample  is  a  64  x  64  pixel  image.  Figure  4.4  shows  the  class  separation 
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Figure  4.4:  Decomposition  results:  selected  subbands  and  computed 
class  separabilities  (left);  increase  in  CCS  with  the  number  of  features 
(right). 

obtained  at  each  level  of  the  selected  WP  decomposition.  The  figure  also  shows 
the  improvement  obtained  because  of  using  both  fi2  and  The  significant  ef¬ 
fect  of  using  these  features  on  classification  performance  also  suggests  that  tree 
selection  should  not  be  based  only  on  local  energies  (or  second  moments). 

In  Figure  4.5  some  of  the  classification  results  for  the  ten  textures  in  Figure 
4.3  are  given.  Also  their  corresponding  clusters  in  the  best  3-D  feature  space 
based  on  the  suggested  dimensionality  reduction  idea  are  shown.  Classification 
results  are  obtained  based  on  class  separation  analysis  and  the  suggested  algo¬ 
rithm.  The  four  most  important  features  are  selected.  A  simple  feed-forward 
neural  network  [71,  49]  with  four  input,  eight  hidden,  and  ten  output  units  is 
used  for  classification.  In  some  stages  of  building  the  wavelet  packet  tree  (Fig¬ 
ure  4.4),  energy  and  separation  based  criteria  suggest  different  strategies  for 
extending  the  decomposition. 
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Figure  4.5:  Some  of  the  classification  results  (left);  clusters  in  the 
selected  3-D  feature  space  (right). 

Finally,  some  tests  on  90°  and  180°  rotated  input  textures  are  performed. 
The  classification  errors  increase  by  between  1%  and  14%,  depending  on  the 
textures.  This  is  partly  due  to  the  separability  of  the  filters  and  partly  because 
of  the  basis  selection  algorithm.  The  algorithm  by  its  nature  looks  for  common 
features  among  all  examples  of  the  same  class  as  well  as  features  that  discrim¬ 
inate  examples  in  different  classes.  So  if  all  examples  of  a  directional  texture 
are  selected  from  the  same  image,  it  is  expected  that  the  algorithm  will  pick  up 
some  directionally  sensitive  features.  In  general,  depending  on  the  task,  differ¬ 
ent  rotated  versions  of  a  directional  texture  may  or  may  not  be  “defined”  as  the 
same  texture,  and  this  has  to  be  considered  in  the  class  separation  analysis.  To 
test  this  idea,  for  each  texture  we  included  some  rotated  examples  defined  as 
being  in  the  same  class,  and  we  applied  the  same  feature  selection  algorithm. 
Although  the  rotated  examples  were  not  included  in  the  training  of  the  classi¬ 
fier  network,  the  resulting  classification  performance  on  rotated  input  textures 
improved  significantly,  e.g.  from  about  96%  to  98%  for  64  x  64  windows. 

Despite  the  simplicity  of  the  system,  the  results  are  comparable  to  other 
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recently  published  texture  classification  schemes  [13].  Note  that  in  this  approach 
the  selection  of  basis/features  in  performed  once  for  all  intended  classes  whereas 
in  [13]  for  every  input  example  the  decision  about  the  tree  structure  has  to  be 
made  on-line  and  separately.  Also,  since  our  suggested  basis  selection  is  based  on 
observations  over  a  group  of  examples  for  each,  class  the  resulting  tree  structure 
is  less  susceptible  to  errors  in  the  noisy  examples. 

In  order  to  test  the  effect  of  windowing  on  class  separability,  we  tested  four 
different  window  sizes,  as  shown  in  Figure  4.5.  As  expected,  whenever  we  reduce 
the  window  size  for  better  localization,  we  lose  class  separation,  which  results 
in  less  accurate  or  less  certain  local  decisions.  Thus  the  need  for  soft  local 
classification  and  context-dependent  decisions  is  apparent. 

4.4  Texture  and  Image  Segmentation 

For  the  texture  segmentation  tests  the  features  are  based  on  segmentation  win¬ 
dows,  i.e.  the  central  moments  are  computed  over  small  windows  on  the  de¬ 
composed  image.  Because  of  the  down-sampling  involved  in  the  transform,  the 
corresponding  window  sizes  for  the  sub-bands  at  the  level  of  the  tree  are 
W/{2^).  Therefore  the  depth  of  the  tree  is  limited  by  the  size  of  the  input  win¬ 
dow  and  the  nature  of  the  signals  to  be  classified.  Also,  the  order  of  the  filters  in 
filter  bank  implementations  should  be  smaller  than  the  window  size  to  avoid  the 
dominance  of  window  boundary  effects  on  the  resulting  feature  computation. 

Figure  4.6  shows  the  segmentation  results  for  three  visually  similar  textures. 
In  the  test  a  window  size  of  16  x  16  pixels,  with  8-pixel  overlap,  is  chosen  and 
decision  integration  is  used.  In  this  test  we  used  a  simple  two-layer  neural 
network  with  just  three  input,  four  hidden,  and  three  output  units  to  build  our 
soft  classifier.  As  this  figure  shows,  results  comparable  to  those  of  other  texture 
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Figure  4.6:  Example  of  a  texture  segmentation  using  a  reduced  two- 
dimensional  feature  space:  (left)  original  image,  (right)  segmentation 
result. 


Figure  4.7:  The  clusters  corresponding  to  three  textures  in  the  seg¬ 
mented  image,  based  on  the  separability  criterion  (top)  and  based 
on  an  energy  criterion  (bottom). 

segmentation  schemes,  including  wavelet-based  systems  [38,  55],  are  obtained, 
using  a  generic  scheme  of  low  complexity  and  with  a  small  number  of  features. 

Figure  4.7  compares  the  cluster  separations  in  the  feature  space  when  the 
best  feature  vectors  are  selected  according  to  the  suggested  class  separability 
based  linear  map  to  those  obtained  using  dominant  energy  based  approaches. 
As  this  example  illustrates,  with  the  same  feature  size  the  suggested  method 
provides  a  very  good  separation  of  classes. 


Chapter  5 

Layout-Independent  Document  Page  Segmentation 

5.1  Introduction 

Recent  advances  in  information  and  communications  technologies  have  increased 
the  need  for,  and  therefore  the  interest  in,  automated  processing  of  documents. 
Efficient  storage  and  transmission  of  documents  as  well  as  archiving  and  infor¬ 
mation  retrieval  for  document  databases  and  “digital  libraries”  have  become 
important  research  issues. 

Two  important  tasks  of  most  document  processing  systems  are  page  decom¬ 
position  and  optical  character  recognition  (OCR).  For  coding  or  understanding 
a  document  it  is  essential  to  identify  text,  image  and  graphics  regions,  as  a 
physical  segmentation  of  the  page,  in  order  to  be  able  to  process  it  appropri¬ 
ately.  For  example,  one  must  identify  the  text  regions  before  applying  OCR 
algorithms,  and  identify  graphics  regions  before  attempting  to  interpret  or  vec¬ 
torize  them.  Physical  page  segmentation  may  also  be  required  for  the  task  of 
functional  layout  analysis,  to  identify  the  document’s  type  (e.g.  journal,  memo, 
check,  etc.)  or  to  generate  hypotheses  as  to  the  components’  roles  and  logical 
functions  (title,  abstract,  footnote,  caption,  signature,  table,  etc.).  As  part  of  a 
source  compression  scheme  one  may  consider  a  document  image  as  a  composite 
source,  decompose  it  into  text,  image  and  graphics  sub-sources  where  each  sub¬ 
source  has  more  “homogeneous”  outputs,  and  design  separate  coding  schemes 
for  each  sub-source,  based  on  appropriate  fidelity  criteria  [14];  see  Figure  5.1. 

Page  segmentation  and  layout  analysis  methods  described  in  the  literature 
make  use  of  well-known  image  processing  tools  which  can  be  broadly  classified 
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Figure  5.1;  The  role  of  page  segmentation  in  document  image 
processing. 


as  bottom-up  and  top-down  [81].  Bottom-up  tools,  such  as  connected  compo¬ 
nent  analysis  [32],  start  from  the  pixel  level  and  merge  regions  together  into 
larger  and  larger  components  (e.g.  first  characters,  then  words,  text  lines,  para¬ 
graphs,  etc.).  Top-down  techniques  apply  a  priori  knowledge  about  the  page  to 
hypothesize  and  split  the  page  into  blocks  which  are  subsequently  identified  and 
subdivided  further.  For  example,  one  may  first  locate  major  columns  and  then 
split  them  further  into  paragraphs,  text  lines,  and  eventually  words.  Examples 
of  algorithms  which  use  a  top-down  approach  include  recursive  projection  profile 
cuts  [87,  84],  run  length  smoothing  and  constrained  run  length  [85].  In  general, 
most  approaches  use  a  combination  (or  hybrid)  of  top-down  and  bottom-up 
techniques. 

One  method,  described  in  [84,  50],  uses  projection  profiles  and  an  X-Y  tree 
representation  of  documents  to  exploit  the  fact  that  the  components  of  printed 
pages  (e.g.  text  blocks,  tables,  figures)  can  often  be  bounded  by  rectangular 
blocks.  The  root  of  the  tree  is  the  entire  page  and  after  iterated  subdivision, 
based  on  changes  in  the  projection  profiles,  each  rectangular  block  in  the  page 
is  represented  by  a  node  in  the  tree.  This  results  in  a  hierarchical  block  segmen¬ 
tation  of  the  page. 

The  constrained  run  length  algorithm  starts  from  the  binary  image  and  re- 
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places  every  string  of  contiguous  O’s  (corresponding  to  white  pixels)  of  length 
less  than  a  predetermined  constant  by  a  string  of  I’s  (i.e.  black  pixels)  of  the 
same  length  [14,  85].  This  binary  smearing  process  is  performed  in  both  hor¬ 
izontal  and  vertical  directions.  The  final  bit  map  is  obtained  from  the  logical 
“AND”  of  the  two  outputs.  The  vertical  and  horizontal  constraint  lengths  are 
determined  from  anticipated  inter-component  spacing.  Clearly  these  methods 
are  dependent  on  assumptions  about  component  sizes,  component  proximity, 
and  page  orientation.  A  survey  of  the  most  common  techniques  is  contained  in 
[81]. 

Recently,  a  more  flexible  method  of  page  segmentation  based  on  analysis  of 
background  white  space  has  been  explored  by  several  authors  [3,  64,  73].  The 
scheme  is  based  on  tracking  major  white  spaces  between  printed  components  to 
identify  region  boundaries.  This  method  is  based  on  relatively  few  assumptions 
and  provides  good  results  even  for  skewed  pages  or  documents  with  complex 
layouts.  For  identification  of  component  type,  some  approaches  use  simple  sta¬ 
tistical  tests  to  classify  detected  major  blocks  as  text  or  non- text  regions  [84]. 
Black  pixel  density,  black/white  ratio  or  transitions,  average  vertical  or  horizon¬ 
tal  run  lengths,  and  row-by-row  cross-correlations  [65]  are  some  of  the  features 
used  in  these  post- classification  stages. 

Each  of  the  above  techniques  relies  to  a  different  extent  on  prior  knowledge 
about  the  generic  document  layout  structure,  such  as  rectangularity  of  major 
blocks,  consistency  in  horizontal  and  vertical  spacing,  and  independence  of  text, 
graphic  and  image  blocks,  and/or  assumptions  about  textual  and  graphical  at¬ 
tributes  such  as  font  size  and  text  line  orientation.  Utilizing  knowledge  about 
the  layout  and  structure  of  documents  results  in  simple,  elegant  and  efficient 
page  decomposition  systems  but  also  limits  the  range  of  applicability  of  the  al- 
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gorithms.  For  example,  methods  based  on  projection  profiles  fail  if  the  page 
layout  is  complex,  the  page  is  skewed,  or  text  strings  on  the  page  have  different 
orientations.  There  are  methods  of  estimating  and  correcting  the  skew  angle 
[61,  4],  but  they  have  limited  ranges  and  add  to  the  complexity  of  the  system. 
Methods  based  on  smearing  or  white  spaces  are  sensitive  to  character  sizes  as 
well  as  line  and  character  spacing.  They  may  also  fail  when  text  regions  touch 
images  or  are  embedded  in  them. 

In  some  applications  it  is  desirable  to  have  segmentation  methods  that  do  not 
assume  a  priori  knowledge  about  the  content  and  attributes  of  text,  or  about  the 
boundaries  of  major  blocks.  Such  approaches  should  be  robust  to  skew,  noise 
and  other  degradation.  Some  of  the  difficulties,  shown  in  Figure  5.2,  which  are 
common  in  general  classes  of  documents,  and  which  make  these  goals  hard  to 
attain  include; 

•  Noise  and  degradation  caused  by  copying,  scanning,  transmission  or  aging. 

•  Page  skew  and  text  lines  with  different  orientations  on  the  same  page. 

•  Text  touching  or  overlapping  with  image  and  graphics  components. 

•  Combinations  of  varying  text  and  background  gray  levels  (e.g.  inverted 
text). 

•  Complex  and  irregular  layout  structures  that  are  common  especially  in 
non-technical  documents.  Document  objects  may  not  have  rectangular  or 
even  convex  boundaries  and  may  be  embedded  in  one  another. 

•  Curved  lines  or  multi-column  pages  where  text  lines  in  the  two  columns 
are  not  of  the  same  size  and/or  are  not  aligned. 

•  Differences  in  language,  font  size  and  other  textual  attributes. 
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Figure  5.2:  Examples  of  difficult  cases  for  document  page  decompo¬ 
sition 


See  Figure  5.3,  for  example,  where  despite  the  fairly  simple  page  layout, 
the  projection  profile  based  system  has  difficulty  because  no  line  across  the 
page  can  separate  text  and  image  regions.  Any  of  these  problems  may  cause 
failure  of  the  previously  described  techniques,  and  it  is  not  uncommon  to  see 
combinations  of  the  cases  described  above,  on  the  same  page.  Based  on  these 
observations,  a  texture-based  segmentation  method  for  extracting  text  has  been 
suggested  by  Jain  et  al.  [45].  The  approach  uses  multi-channel  Gabor  filters 
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Figure  5.3;  An  example  of  a  document  with  simple  layout  for  which 
projection  profile  based  methods  fail. 

as  the  input  features  of  a  classifier  whose  outputs  are  directly  used  to  identify 
text.  This  method  is  computationally  expensive  and  does  not  provide  a  means 
for  incorporating  context  information. 

In  this  chapter,  text,  image  and  graphics  regions  in  a  document  image  are 
described  as  three  classes  of  textures.  The  idea  can  be  justified  by  the  fact  that 
humans  can  identify  document  objects  easily  even  from  low-resolution  images  or 
from  distant  views  of  a  document  page.  This  show-s  that  the  physical  segmenta¬ 
tion  of  a  document  is  not  detail-  or  content-sensitive,  and  like  texture  segmen¬ 
tation,  is  a  low'^-level  vision  process.  Given  the  following  considerations,  some  of 
the  existing  texture  segmentation  techniques  [38,  43,  17]  can  be  modified  and 
used  to  identify  these  regions  on  the  page.  One  distinctive  feature  of  this  task. 
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compared  to  texture  segmentation  problems,  is  that  there  are  large  inter-class, 
as  well  as  intra-class,  variations  in  the  textural  features.  Text  and  graphics  are 
texturally  quite  different,  but  different  images  may  also  contain  scenes  of  signifi¬ 
cantly  different  textural  structure,  as  is  the  case  for  texts  of  different  fonts,  sizes, 
and  even  languages.  The  variabilities  are  even  more  pronounced  for  graphics. 

An  important  observation  about  the  human  audio/visual  recognition  system, 
which  is  the  backbone  of  most  artificial  neural  network  models,  is  the  improved 
recognition  power  gained  through  interactions  of  simple  computational  units. 
Each  local  process  can  be  as  simple  as  projection  or  filtering,  passing  through 
simple  non-linearities,  etc.  In  addition,  all  decisions,  at  least  in  low-level  vision, 
are  non-binary,  highly  context-dependent,  and  based  on  multi-scale  representa¬ 
tions  of  the  input  signals /images  [83,  93].  With  these  motivations  we  search  for 
consistent  multi-scale  context  dependent  schemes  based  on  soft  local  decisions. 

Our  method  is  based  on  the  fact  that  there  is  some  uncertainty  associated 
with  the  local  decisions  over  small  windows,  due  to  the  limited  view  of  the  signals 
and/or  to  the  randomness  and  ambiguity  inherent  in  the  problem,  or  even  to  the 
presence  of  multiple  classes,  overlapped  or  adjacent,  in  the  same  window.  Using 
large  windows  is  not  recommended,  because  over  larger  windows  the  signals  are 
highly  non-stationary  and  features  computed  based  on  heterogeneous  micro¬ 
structures  are  less  reliable.  Also,  larger  windows  provide  less  spatial  resolution, 
which  is  of  great  concern  in  segmentation  schemes. 

In  the  document  domain,  image  sub-blocks  may  contain  text,  image  and 
graphic  sub-regions  adjacent  to,  or  overlapping,  one  another  (Figure  5.4).  Such 
situations  may  occur  on  boundaries,  where,  for  example,  text  lines  come  close 
to  or  touch  image  regions,  or  when  major  text  regions  occur  in  an  image  or  on 
a  graph.  In  such  cases  it  is  not  appropriate,  even  for  an  optimally  designed 
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(a)  (b)  (c) 

Figure  5.4:  Examples  of  image  blocks  for  which  there  is  no  correct 
hard  decision:  (a)  text  overlapped  on  an  image,  (b)  both  text  and 
graphics  in  a  block,  (c)  both  image  and  text  in  a  block. 

classifier,  to  make  haxd  (binary)  local  decisions.  These  cases  exhibit  an  inherent 
fuzziness  of  class  membership  which  does  not  come  from  the  noise  or  random¬ 
ness,  and  they  support  our  claim  that  soft  local  decisions  are  more  realistic 
and  efficient.  The  uncertainties  reflected  in  soft  decisions  are  then  reduced  by 
propagating  and  integrating  decisions  made  independently  in  the  neighborhoods 
within  and  across  scales. 

In  this  chapter  we  propose  the  utilization  of  multiscale  representations  in  a 
soft  decision  framework  for  the  task  of  layout-independent  physical  page  segmen¬ 
tation.  In  an  attempt  to  handle  even  the  most  difficult  cases  of  segmentation,  we 
make  few  assumptions  about  the  document’s  textual  and  graphical  attributes 
and  layout  structure.  The  system  is  designed  so  that  as  hypotheses  about  doc¬ 
ument  components  are  generated  and  verified,  more  domain-specific  processing 
may  occur. 

The  organization  of  the  chapter  is  as  follows:  In  Section  5.2  the  pyramidal 
wavelet  transforms  and  their  generalized  form,  wavelet  packets,  are  introduced. 
These  transforms  are  used  to  compute  the  input  feature  vectors  at  different 
scales/resolutions.  In  Section  5.3  we  describe  how  multi-scale  feature  vectors 
are  used  for  “soft  classification”  of  small  windows  and  how  the  “propagation” 
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and  “integration”  of  those  soft  decisions,  within  and  across  scales,  can  improve 
the  overall  classification  and  segmentation  performance.  Some  issues  about  in¬ 
corporating  prior  knowledge  of  structure  (or  a  model  of  the  document)  and 
the  notion  of  “biased  voting”  are  then  addressed.  Some  comments  about  the 
post-processing  stages  are  given  in  Section  5.4.  Page  segmentation  experiments, 
showing  the  performance  of  the  method,  are  then  described  in  Section  5.5.  Fi¬ 
nally  the  results  are  discussed  and  some  suggestions  about  possible  variations 
and  future  directions  are  made. 

5.2  WP  Decomposition  of  Document  Pages 

The  fact  that  document  objects  (e.g.  characters  and  lines)  appear  at  multiple 
scales  and  our  belief  that  physical  segmentation  is  a  low-level  vision  process  sim¬ 
ilar  to  texture  analysis  suggest  that  the  use  of  multi-resolution  representations 
is  appropriate.  There  are  several  classes  of  multi-scale  decompositions  that  seem 
to  be  biologically  plausible  and  that  have  been  successfully  employed  in  mod¬ 
ern  signal  processing  schemes.  In  this  paper  we  use  wavelet-based  decomposi¬ 
tions  (Figure  5.5)  because  they  provide  perfectly  reconstructible  decompositions 
through  fast  algorithms  [56,  22].  For  a  given  class  of  signals,  wavelet  packets 
can  be  adaptively  designed  to  obtain  compact  representations  that  meet  a  pre¬ 
determined  objective  criterion  [19].  Also  the  perfectly  reconstructible  multiscale 
representation,  employed  in  our  system,  can  be  used  as  part  of  a  multi-scale  doc¬ 
ument  compression  scheme. 

Following  our  discussions  about  efficient  discriminant  feature  extraction  in 
Chapter  2,  we  build  the  tree  in  such  a  way  that  the  spread  of  feature  points  in 
each  class  becomes  smaller  and  at  the  same  time  clusters  become  farther  apart. 
The  feature  vectors  consist  of  central  moments  computed  over  local  windows  on 
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Figure  5.5:  An  example  of  a  pyramidal  WT  on  a  document  page: 
the  original  image  (left);  the  wavelet  decomposition  (right). 
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Figure  5.6:  Clusters  in  the  feature  space  may  overlap;  the  three 
clusters  shown  correspond  to  text,  image  and  graphics  subblocks  in 
the  database. 

different  subbands. 

Figure  5.6  shows  three  clusters  of  feature  points  corresponding  to  text,  image 
and  graphics  blocks  in  a  database.  The  features  are  extracted  based  on  the 
maximum  class  separability  criterion.  Considering  the  nature  of  mixed  classes, 
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overlap  of  clusters  may  be  inevitable.  In  the  following  experiments,  a  pyramidal 
wavelet  transform  and  separability-based  wavelet  packet  trees  are  used. 

5.2.1  Knowledge-based  Post-processing 

In  some  applications  we  may  wish  to  incorporate  constraints  into  our  decision 
based  on  a  priori  or  derived  knowledge  about  the  domain.  For  example,  we 
may  observe  patterns  of  data  that  result  from  rules  subject  to  physical  con¬ 
straints.  In  the  case  of  document  segmentation  into  text,  graphics  and  image 
components,  regions  are  typically  rectangular,  text  symbols  are  arranged  along 
straight  lines,  and  small  graphics  and  images  within  text  regions  are  unlikely. 
For  more  structured  classes  of  documents,  blocks  such  as  the  title,  abstract  and 
page  number  are  expected  to  be  in  specific  regions  of  the  page,  and  even  to  have 
specific  attributes  and  formats.  In  general  these  constraints  are  task-dependent. 

Although  such  constraints  are  often  considered  in  higher  levels  of  processing, 
one  may  also  utilize  them  in  the  early  stages  of  classification  to  get  more  reliable 
results.  In  the  context  of  the  described  majority  vote  method,  this  idea  can 
easily  be  fit  into  the  system  without  increasing  its  complexity  by  a  biased  voting 
scheme.  Our  expectation  about  observing  a  certain  class  of  patterns  in  a  certain 
part  of  the  scene  is  reflected  in  a  biased  vote  in  favor  of  a  particular  class  over  that 
region.  In  this  case  the  system  does  not  start  from  an  all-zero  vote  matrix  Vtot, 
but  at  each  position  a  small  non- zero  initial  vote  is  already  given  to  the  class  (es) 
that  have  been  frequently  observed  in  that  location.  The  “biased  voting”  can 
be  viewed  as  not  starting  from  the  middle  of  the  fuzzy  decision  cube  (i.e.  the 
most  fuzzy  point),  but  deviating  from  it  in  favor  of  one  of  the  classes  (in  the 
corners);  see  Figure  5.7. 

The  prior  vote  or  decision  bias  for  each  macro-pixel  can  be  computed  from 
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(a)  (b) 

Figure  5.7:  Fuzzy  decision  square:  (a)  when  we  make  a  one-shot 
decision;  (b)  when  a  weighted  sum  of  soft  decisions  is  used.  A  final 
decision  outside  the  gray  area  has  a  low  level  of  confidence. 

the  empirical  distributions  of  document  objects  in  the  labeled  documents  of  the 
training  set  F: 

\/Loen  =  Pr{C  =  c\u}  (5.2.1) 

~  =  c}  for  c  =  1,2,...,L 

I  7€r 

where  L{'y,uj)  is  the  label  of  macro-pixel  cu  derived  from  ground  truth  data,  L  is 
the  number  of  classes,  and  /{.}  is  the  indicator  function.  This  initial  vote  can 
be  used  if  the  type  or  class  of  document  is  already  determined  by  other  means 
or  is  known  a  priori. 

The  spatial  patterns  of  combined  soft  decisions  in  vote  matrices  directly  re¬ 
flect  the  locations,  shapes  and  classes  of  major  blocks.  In  some  cases,  however, 
obtaining  a  more  precise  segmentation  requires  some  knowledge-based  post¬ 
processing  to  incorporate  additional  knowledge  about  the  structure.  In  such 
cases  the  spatial  pattern  of  votes  also  provides  a  convenient  starting  point  to 
apply  constraints  and  performs  further  analysis.  The  local  nature  of  texture- 
based  segmentation  of  documents  sometimes  results  in  sparse  mis-classified  re¬ 
gions.  For  example,  the  textural  characteristics  of  the  leaves  on  a  tree  in  the 


73 


image  part  of  a  page  may  locally  resemble,  and  therefore  be  classified  as,  text 
elements.  Some  of  these  sparse  mis-classifications  are  corrected  after  the  de¬ 
cision  integration  process,  but  in  some  cases  higher-level  and  knowledge-based 
post-processing  may  be  needed.  Certain  rules  and  constraints  can  be  consid¬ 
ered,  with  different  ranges  of  generality  and  therefore  applicability.  These  rules 
put  restrictions  on  the  absolute  locations  of  document  objects  on  the  page  and 
their  positions  relative  to  each  other.  A  stroke  embedded  in  a  text  region  should 
be  interpreted  as  a  character,  but  the  exact  same  pattern  in  the  margin  should 
be  interpreted  as  noise.  Similarly,  one  might  hypothesize  that  “small”  blobs 
labeled  as  graphics,  as  well  as  small  text-like  regions  in  an  image  block  with 
no  collinearity  between  them,  have  been  misclassified  and  therefore  can  be  re¬ 
labeled.  Depending  on  how  likely  they  are  to  be  encountered  in  an  application, 
one  can  put  restrictions  on  the  shapes  and  minimum  sizes  of  the  labeled  regions. 
For  example,  one  may  make  use  of  the  fact  that  text  elements  tend  to  be  orga¬ 
nized  into  lines,  and  are  typically  left- justified  in  groups  that  fall  into  columns 
of  rectangular  shape.  Although  restrictive,  assumptions  about  the  minimum 
and  maximum  character  sizes,  as  well  as  minimum  sizes  of  image  and  graphical 
objects,  may  be  derived  and  utilized. 

By  applying  these  structural  rules,  one  can  hypothesize  and  remove  logically 
undesirable  gaps  and  noise-like  small  blobs  of  misclassified  regions.  Similarly 
one  may  complete  and  rectify  region  boundaries  and  fit  them  with  polygons  or 
rectangles,  to  obtain  a  parametric  layout  representation  consistent  with  derived 
knowledge  of  the  domain. 

As  an  example,  and  without  going  into  the  details  of  imposing  layout-specific 
constraints  in  our  experiments,  we  establish  a  structural  hierarchy.  It  suggests 
that  a  text  region  is  typically  uniform  and  contains  no  graphics  or  image  com- 
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ponent,  but  a  graphic  or  image  region  may  have  subordinate  text.  The  text 
must,  however,  have  a  uniform  background  and  extend  over  a  relatively  large 
area  with  respect  to  the  font  size.  These  properties  are  enforced  by  applying 
morphological  operations  of  each  type  to  filter  out  noise  which  is  too  small  to 
constitute  a  document  component.  Constraints  on  the  sizes  of  regions,  as  de¬ 
scribed  in  Section  5.1,  are  implemented  using  3x3  morphological  kernels.  Given 
a  constraint  on  the  size  of  a  text  region,  we  perform  a  closing  operation  on  the 
image,  to  eliminate  regions  which  appear  locally  as  text.  For  the  image  reso¬ 
lution  and  window  size  used  in  the  experiments  described  in  the  next  section, 
a  six-step  closing  operation  eliminates  a  majority  of  the  noise  regions.  Since  a 
six-step  closing  operation  is  approximately  the  size  of  a  capital  “M”  in  a  9pt 
font,  we  do  not  have  to  be  concerned  that  larger  text  regions  will  be  eliminated. 
For  text  which  actually  appears  as  part  of  the  image,  higher-level  constraints 
must  be  used,  if  possible,  to  associate  the  text  with  the  image. 

5.3  Experiments 

To  show  the  effectiveness  of  the  suggested  soft  decision  integration  method  it 
has  been  applied  to  document  page  segmentation. 

5.3.1  Input  Representation  and  Training  Set 

In  the  following  experiments  both  wavelet  transform  and  wavelet  packet  decom¬ 
positions  are  used  as  input  signal  representations.  In  the  first  two  examples, 
features  are  computed  from  a  two-level  wavelet  transform.  At  each  level,  only 
detail  subbands  are  used  and  there  is  one  classifier  for  each  scale.  The  result 
of  classification  at  the  two  scales  are  combined  as  described  in  Section  3.3.  In 
the  other  examples,  features  are  selected  using  a  separability  measure  on  the 
wavelet  packet  decomposition.  For  these  experiments,  six  features  that  contain 
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the  highest  classification  information,  based  on  the  previously  used  separability 
measure,  have  been  selected  and  used. 

The  input  data  consist  of  several  gray-scale  document  pages  scanned  at  200 
dpi  and  the  input  features  are  the  second  and  third  central  moments  (^2  and 
jXz)  of  the  image  subbands  computed  over  small  windows  W  on  the  decomposed 
image. 

The  training  set  consists  of  about  200  samples  from  each  of  the  text,  image 
and  graphics  sub-blocks.  These  16  x  16  pixel  sub-blocks  are  extracted  randomly 
from  several  document  pages.  In  order  to  avoid  over-training,  a  “validation” 
set  is  used  to  test  the  performance  of  the  network,  after  every  ten  iterations, 
during  the  training  stage.  As  training  proceeds,  errors  on  both  the  training 
and  validation  sets  decrease.  Training  is  suspended  as  soon  as  the  error  in 
the  validation  set  starts  increasing.  If  the  desired  performance  is  achieved,  the 
process  stops;  otherwise,  part  of  the  validation  set  is  included  in  the  training  set 
and  training  proceeds  on  the  augmented  training  set. 

5.3.2  Network  Description  and  Training 

In  all  of  the  experiments,  multi-layer  feed-forward  neural  networks  are  used 
as  the  soft  classifiers.  The  network  consists  of  six  input,  eight  hidden,  and 
three  output  units.  The  input  units  are  linear,  whereas  the  hidden  and  output 
units  have  sigmoid  nonlinearities.  A  conjugate  gradient  method  is  used  for  fast 
convergence  of  the  supervised  learning  algorithm  [88]. 

The  three  outputs  correspond  to  text,  image,  and  non-text  non-image  classes. 
In  other  words,  any  sub-block  not  identified  as  text  or  image  is  considered  as 
“graphics” .  Blank  regions  are  detected  separately  in  a  straightforward  way.  The 
outputs  can  take  values  in  [0, 1]  and  the  network  is  trained  in  such  a  way  that 
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these  outputs  provide  soft  non-binary  decisions  about  the  class  memberships  of 
the  input  image  blocks.  This  is  essential  because,  as  mentioned  above,  small 
regions  may  locally  resemble  more  than  one  class,  or  the  image  sub-block  may 
be  composed  of  text,  image  or  graphics  subregions.  In  such  cases,  during  the 
training,  outputs  corresponding  to  text,  graphics  and  images  are  required  to  take 
target  values  roughly  in  proportion  to  the  fraction  of  block  area  they  occupy. 
Including  such  composite  blocks  in  the  training  set  results  in  better  performance 
on  the  boundaries.  If  a  decision  integration  stage  is  used  the  result  will  be  much 
less  sensitive  to  these  adjustments. 

Despite  its  significance,  the  effect  of  a  suitable  output  representation  is  some¬ 
times  overlooked.  In  fact  in  some  cases,  such  as  design  and  training  of  soft 
decision  based  classifiers,  the  choice  of  output  representation  can  be  as  impor¬ 
tant  as  that  of  input  representation.  In  this  experiment,  in  order  to  provide 
the  learning  algorithm  with  a  consistent  set  of  input-output  pairs  the  following 
procedure  has  been  implemented;  Assuming  that  the  data  in  the  training  set 
is  labeled  correctly  and  consistently,  for  any  macro-pixel  a;  and  any  class  c  one 
can  compute  the  desired  soft  decision  for  class  membership  as 

Vu.  €  SI  E  ^(Lab(i)  =  c)  (5.3.2) 

\^  \  x€W 

i.e.,  the  relative  number  of  pixels  in  the  window  labeled  as  c.  This  form  of  target 
value  computation  is  consistent  with  our  assumption  about  spatial  relevance.  It 
is  also  a  suitable  means  of  determining  soft  local  decisions  when  mixed  classes  are 
present  in  the  window,  e.g.  overlapped  and  adjacent  text  and  image  components 
in  the  area  covered  by  W.  These  labeled  examples  are  the  basis  for  learning  the 
fuzzy  membership  functions  in  our  multidimensional  feature  space. 
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5.4  Results  and  Discussion 


The  decision  integration  scheme  described  in  this  chapter  has  been  used  to 
identify  text,  images,  and  graphics  regions.  We  have  tested  our  approach  on 
a  number  of  document  images  which  are  difficult  for  other  approaches  due  to 
multiple  scales  and  complex  (but  not  unusual)  layout  of  the  components. 

Figure  5.8:  For  this  example  feature  vectors  are  computed  from  the  wavelet 
transform  with  two  levels  of  decomposition.  The  example  shows  the  advantage 
of  using  decision  integration  in  identifying  major  document  blocks.  In  this  image 
we  have  a  skewed  page  with  multiple  columns  where  the  text  lines  of  the  different 
columns  are  not  aligned,  the  image  is  surrounded  by  text,  and  there  are  two  text 
fonts/sizes  on  the  page. 

Figure  5.9:  This  is  the  same  example  that  was  shown  in  Figure  5.3.  In 
this  test  we  have  used  only  two  features  extracted  from  the  wavelet  packet 
decomposition  of  the  document  images  in  our  database;  the  features  are  selected 
based  on  the  aforementioned  separability  criterion.  This  is  an  example  of  a  page 
with  different  font  sizes  and  non-rectangular  object  boundaries. 

Figure  5.10:  This  example  shows  the  effectiveness  of  the  suggested  scheme 
for  cases  where  image  and  text  regions  are  very  close  to  each  other  and  regions 
do  not  have  rectangular  or  even  convex  boundaries.  For  this  example  the  re¬ 
sults  of  prescribed  post-processing  based  on  morphological  operations  are  also 
illustrated. 

Figure  5.11:  This  example  shows  a  very  difficult  scenario  where  text  is 
embedded  in  the  image,  i.e.  where  different  clcisses  of  objects  are  overlapped. 
Even  in  this  case  our  method  provides  good  results. 
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Figure  5.8:  Page  segmentation  results  for  a  document  image.  (From 
left  to  right):  original  image,  two-level  wavelet  decomposition,  seg¬ 
mentation  without  decision  integration,  segmentation  with  decision 
integration.  Dark  gray  and  light  gray  represent  image  and  text  areas 
respectively. 
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Figure  5.9:  Page  segmentation  results  for  a  complete  page,  with 
multiple  font  sizes 


Si 


(C)  (d) 

Figure  5,10:  An  example  of  a  difficult  a  segmentation:  irregular 
and  non-convex  image  boundary  very  close  to  text,  (a)  Original  im¬ 
age;  (b)  segmentation  without  post-processing:  (c)  result  after  post¬ 
processing;  (d)  final  segmentation. 


Figure  5.11:  An  example  of  a  difficult  segmentation:  text  embedded 
in  an  image. 
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5.5  Conclusions 


Our  experiments  have  shown  that  good  page  segmentation  results  can  be  ob¬ 
tained  using  context  information  through  propagation  and  integration  of  soft 
decisions  based  on  multiscale  features.  The  improvement  resulting  from  de¬ 
cision  integration  is  significant  when  confident  hard  local  decisions  cannot  be 
made  because  of  poor  features,  poor  resolution,  windowing  considerations,  noise 
and/or  the  inherent  fuzziness  of  the  classification  task.  Very  good  performances 
have  been  obtained  on  complex  document  layouts,  using  simple  feature  sets  and 
classifiers.  A  majority  of  the  calculations  and  decisions  are  made  independently 
and  in  parallel  without  any  iterative  stages.  They  are  therefore  well  adapted  to 
distributed  and  parallel  algorithms  and  architectures,  which  promise  robust  and 
fast  implementation. 

As  mentioned  earlier,  the  physical  segmentation  process  is  typically  part  of  a 
larger  system  and,  depending  on  the  application,  it  may  be  followed  by  a  func¬ 
tional  decomposition  module  in  a  document  understanding  or  source  encoder 
system.  This  work  can  be  extended  in  a  number  of  directions.  For  example, 
one  may  estimate  the  text  font  and  text  line  orientation  for  all  blocks  labeled 
as  text,  in  order  to  prepare  them  for  OCR  algorithms.  Also,  one  can  find  para¬ 
metric  representations  of  labeled  regions  referenced  to  the  page  so  that  their 
logical  identities  can  be  defined  or  searched  for  in  a  database.  Other  feature 
vectors  such  as  multi-channel  Gabor  filters,  co-occurrence  matrices,  or  sets  of 
document-specific  features,  such  as  black  pixel  densities  or  black/white  transi¬ 
tions,  can  be  explored  to  produce  accurate  local  decisions.  The  basic  idea  of 
incorporating  context  information  through  integrating  soft  local  decisions  can 
be  applied  to  other  image  and  signal  segmentation  tasks. 
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Chapter  6 

Automatic  Face  Recognition 

6.1  Introduction 

Inspired  by  humans’  ability  to  recognize  faces  as  special  objects,  and  motivated 
by  the  increased  interest  in  commercial  applications  of  automatic  face  recog¬ 
nition  as  well  as  the  emergence  of  real-time  processors,  research  on  automatic 
recognition  of  faces  has  become  very  active.  Studies  about  the  analysis  of  hu¬ 
man  facial  images  have  been  conducted  in  various  disciplines.  These  studies 
range  from  psychophysical  analysis  of  human  recognition  of  faces  and  related 
psychovisual  tests  [5,  23]  to  research  on  practical  and  engineering  aspects  of 
computer  recognition/verification  of  human  faces  and  facial  expressions  [91]  or 
race/gender  classification  [9,  35]. 

The  problem  of  Automatic  Face  Recognition  (AFR)  is  a  composite  task  that 
involves  detection  and  location  of  faces  in  a  cluttered  background,  facial  feature 
extraction,  subject  identification,  and  verification  [74, 15].  Depending  on  the  na¬ 
ture  of  the  application,  e.g.  image  acquisition  conditions,  size  of  database,  clut¬ 
ter  and  variability  of  the  background/foreground,  noise,  occlusion,  and  finally 
cost  and  speed  requirements,  some  of  the  subtasks  become  more  challenging 
than  others. 

Detection  of  a  face  or  group  of  faces  in  a  single  image  or  a  sequence  of  images, 
which  has  applications  in  face  recognition  as  well  as  video  conferencing  systems, 
is  a  challenging  task  and  has  been  studied  by  many  researchers  [40, 15,  92].  Once 
the  face  image  is  extracted  from  the  scene,  its  gray  level  and  size  are  usually 
normalized  before  storing  or  testing.  In  some  applications,  such  as  identification 
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of  passport  pictures  or  mug-shots,  conditions  of  image  acquisition  are  usually  so 
controlled  that  some  of  the  preprocessing  stages  may  not  be  necessary. 

One  of  the  most  important  components  of  an  AFR  system  is  the  extraction 
of  facial  features,  which  attempts  to  find  the  most  appropriate  representation  of 
face  images  for  identification  purposes.  The  main  challenge  in  feature  extraction 
is  to  represent  the  input  data  in  a  low- dimensional  feature  space  in  which  points 
corresponding  to  different  poses  of  the  same  subject  are  “close”  to  each  other  and 
“far”  from  points  corresponding  to  instances  of  other  subjects’  faces.  However, 
there  is  a  lot  of  within-class/subject  variation  due  to  differing  facial  expressions, 
head  orientations,  lighting  conditions,  etc.,  which  makes  the  task  more  complex. 

Closely  tied  to  the  task  of  feature  extraction  is  the  intelligent  and  sensible 
definition  of  similarity  between  test  and  known  patterns.  The  task  of  finding  a 
relevant  distance  measure  in  the  selected  feature  space,  and  thereby  effectively 
utilizing  the  embedded  information  to  accurately  identify  human  subjects,  is 
one  of  the  main  challenges  in  face  identification.  In  this  chapter  we  focus  on  the 
feature  extraction  and  face  identification  processes. 

Typically,  each  face  is  represented  using  a  set  of  gray-scale  images/templates, 
a  small-dimensional  feature  vector,  or  a  graph.  There  are  also  various  proposals 
for  recognition  schemes  based  on  face  profiles  [90]  and  isodensity  or  depth  maps 
[36,  60].  There  are  two  major  approaches  to  facial  feature  extraction  for  recog¬ 
nition  in  computer  vision  research;  holistic  template  matching  based  systems, 
and  geometrical  local  feature  based  schemes  and  their  variations  [15]. 

In  holistic  template  matching  systems  each  template  is  a  prototype  face 
or  face-like  gray-scale  image  or  an  abstract  reduced-dimensional  feature  vec¬ 
tor  which  has  been  obtained  through  processing  the  face  image  as  a  whole. 
Low-dimensional  representations  are  highly  desirable  for  large  databases,  fast 
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adaptation,  and  good  generalization.  Based  on  these  needs,  studies  have  been 
performed  about  the  minimum  acceptable  image  size  and  the  smallest  number 
of  gray  levels  required  for  good  recognition  results  [74].  Reduction  in  dimension¬ 
ality  can  also  be  achieved  using  various  data  compression  schemes.  For  example, 
representations  based  on  Principal  Component  Analysis  (PCA)  [18,  48,  80,  66] 
and  Singular  Value  Decomposition  (SVD)  [77]  have  been  studied  and  extensively 
used  for  various  applications.  It  has  also  been  shown  that  the  nonlinear  mapping 
capability  of  multilayer  neural  networks  can  be  utilized  and  the  internal/hidden 
representations  of  face  patterns  which,  typically,  axe  of  much  lower  dimension¬ 
ality  than  the  original  image,  can  be  used  for  race/gender  classification  [9,  35]. 
Some  of  the  most  successful  AFR  schemes  are  based  on  the  Karhunen-Loeve 
Transform  (KLT)  [48,  66],  yielding  so-called  eigenfaces.  In  these  methods  the 
set  of  all  face  images  is  considered  as  a  vector  space  and  the  eigenfaces  are  sim¬ 
ply  the  top  principal  components  of  this  “face  space”;  they  are  computed  as 
eigenvectors  of  the  covariance  matrix  of  the  data. 

In  geometrical  feature-based  systems  one  attempts  to  locate  major  face  com¬ 
ponents  or  feature  points  in  the  image  [21,  58,  69,  76].  The  relative  sizes  of  and 
distances  between  the  major  face  components  are  then  computed.  The  set  of 
normalized  size  and  distance  measurements  constitutes  the  final  feature  vec¬ 
tor  for  classification.  One  can  also  use  the  information  contained  in  the  feature 
points  to  form  a  geometrical  graph  representation  of  the  face  that  directly  shows 
the  sizes  and  relative  locations  of  major  face  attributes  [58].  Most  geometrical 
feature-based  systems  involve  several  steps  of  window-based  local  processing,  fol¬ 
lowed  by  iterative  search  algorithms,  to  locate  the  feature  points.  These  methods 
are  more  adaptable  to  large  variations  in  scale,  size  and  location  of  the  face  in 
an  image  but  are  more  susceptible  to  errors  when  face  details  are  occluded  by 
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objects,  e.g.  by  glasses,  by  facial  hair,  due  to  facial  expressions,  or  by  variations 
in  head  orientation.  Compared  to  template/PCA  based  systems,  these  methods 
are  computationally  more  expensive.  Comparative  studies  of  template  versus 
local  feature-based  systems  can  be  found  in  [15,  9,  67].  There  are  also  vari¬ 
ous  hybrid  schemes  that  apply  the  KLT  and/or  template  matching  idea  to  face 
components  and  use  correlation-based  search  to  locate  and  identify  facial  feature 
points  [9,  66].  The  advantage  of  performing  component-by-component  matching 
is  improved  robustness  against  head  orientation  changes,  but  its  disadvantage 
is  the  complexity  of  searching  for  and  locating  the  face  components. 

The  human  audio/visual  system,  as  a  powerful  recognition  model,  takes  great 
advantage  of  context  and  auxiliary  information.  Inspired  by  this  observation  one 
can  devise  schemes  that  can  consistently  incorporate  context  and  collateral  in¬ 
formation,  when  and  if  they  become  available,  to  enhance  its  final  decisions. 
Incorporating  information  such  as  race,  age  and  gender,  obtained  through  inde¬ 
pendent  analysis,  improves  recognition  results  [66].  Also,  since  face  recognition 
involves  a  classification  problem  with  large  within-class  variations,  caused  by 
dramatic  image  variations  in  different  poses  of  the  subject,  one  has  to  devise 
methods  of  reducing  or  compensating  such  variability,  e.g. 

1.  For  each  subject  store  several  templates,  one  for  each  major  distinct  facial 
expression  and/or  head  orientation.  Such  systems  are  typically  referred  to 
as  view-based  systems. 

2.  Use  deformable  templates  along  with  a  3-D  model  of  a  human  face  to 
synthesize  virtual  poses  and  apply  the  template  matching  algorithm  to 
the  synthesized  representations  [94]. 

3.  Incorporate  such  variations  in  the  process  of  feature  extraction. 
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In  our  experiments,  we  take  the  third  approach  and  keep  the  first  method 
as  an  optional  stage  that  can  be  employed  depending  on  the  complexity  of  the 
specific  task.  Our  approach  is  to  use  holistic  LDA-based  feature  extraction 
for  human  faces  followed  by  evidential  soft  decision  integration  for  multisource 
data  analysis.  This  method  is  a  projection-based  scheme  of  low  complexity 
that  avoids  any  iterative  search  or  computation.  In  this  method  both  off-line 
feature  extraction  and  on-line  feature  computation  can  be  done  at  high  speeds 
and  recognition  can  be  done  almost  in  real  time.  Our  experimental  results 
show  that  very  reliable  recognition  performance  can  be  achieved  with  very  low 
complexity  and  small  numbers  of  features. 

The  organization  of  this  chapter  is  as  follows.  In  Section  6.2,  we  provide  an 
objective  study  of  multi-scale  features  of  face  images  in  terms  of  their  discrim¬ 
inating  power.  In  Section  6.3  we  propose  a  holistic  method  of  projection-based 
discriminant  facial  feature  extraction  through  LDA  of  face  images.  We  also 
make  a  comparative  study  of  the  features  obtained  using  the  proposed  scheme 
and  the  ones  employed  in  compression-based  methods  such  as  PCA/KLT.  In 
Section  6.4  we  address  the  task  of  classification/matching  through  multi-source 
data  analysis  and  combining  soft  decisions  from  multiple  imprecise  information 
sources.  Finally,  based  on  the  reliability  of  the  basic  decisions,  we  propose  a 
task- dependent  measure  of  similarity  in  the  feature  space,  to  be  used  at  the 
identification  stage.  All  the  experiments  in  this  chapter  are  based  on  the  appli¬ 
cation  of  LDA  to  the  original  image,  but  the  ideas  can  be  extended  to  multiscale 
representations. 
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6.2  Linear  Discriminant  Analysis  of  Facial  Images 
As  highly  structured  2-D  patterns,  human  face  images  can  be  analyzed  in  the 
spatial  and/or  the  frequency  domain.  These  patterns  are  comprised  of  compo¬ 
nents  that  are  easily  recognized  at  high  levels  but  are  loosely  defined  at  low 
levels  of  our  visual  system  [59,  23].  Each  of  the  facial  components/features  has 
a  different  discriminatory  power  for  identifying  a  person  or  the  person’s  gender, 
race  or  age.  There  have  been  many  studies  of  the  significance  of  such  features 
using  subjective  psychovisual  experiments  [5,  23). 

Using  objective  measures,  in  this  section  we  propose  a  computational  scheme 
for  evaluating  the  significance  of  different  facial  attributes  in  term  of  their  dis¬ 
criminatory  potential.  The  results  of  this  analysis  can  be  supported  by  subjec¬ 
tive  psychovisual  findings.  To  analyze  any  representation  V,  where  V  can  be 
the  original  image,  its  spatial  segments,  or  transformed  images,  we  provide  the 
following  framework. 

First,  we  need  a  training  set  composed  of  a  relatively  large  group  of  subjects 
with  diverse  facial  characteristics.  The  appropriate  selection  of  the  training  set 
directly  determines  the  validity  of  the  final  results.  The  database  should  contain 
several  examples  of  face  images  for  each  subject  in  the  training  set  and  at  least 
one  example  in  the  test  set.  These  examples  should  represent  different  frontal 
views  of  subjects  with  minor  variations  in  view  angle.  They  should  also  include 
different  facial  expressions,  lighting  and  background  conditions,  and  examples 
with  and  without  glasses.  It  is  assumed  that  all  images  are  already  normalized 
to  m  X  n  arrays  and  they  only  contain  the  face  regions  and  not  much  of  the 
subjects’  bodies. 

Second,  for  each  image/subimage,  starting  with  the  two-dimensional  m  x  n 
array  of  intensity  values  /(x,  j/),  we  construct  the  lexicographic  vector  expansion 


87 


(j>  G  This  vector  corresponds  to  our  initial  representation  of  the  face. 

Thus,  the  set  of  all  faces  in  the  feature  space  is  treated  as  a  high-dimensional 
vector  space. 

Third,  by  defining  all  instances  of  the  same  person’s  face  as  being  in  one  class 
and  the  faces  of  different  subjects  as  being  in  different  classes,  for  all  subjects 
in  the  training  set,  we  establish  a  framework  for  performing  a  cluster  separation 
analysis  in  the  feature  space.  Also,  having  labeled  all  instances  in  the  training 
set  and  having  defined  all  the  classes,  we  compute  the  within-  and  between-class 
scatter  matrices,  i.e.  and  Sh  respectively.  Then  we  can  use  any  of  the  class 
separability  measures  of  Chapter  2.  For  example 

4  =  Sep(F)  =  tr(5(^))  (6.2.1) 

Jy  =  tr(56)/tr(5iu)  (6.2.2) 

can  be  considered.  In  this  test  Jy  =  Jy  is  our  measure  of  the  Discriminatory 
Power  (DP)  of  a  given  representation  V.  As  mentioned  above,  the  representation 
may  correspond  to  the  data  in  its  original  form  (e.g.  a  gray-scale  image),  or  it 
can  be  based  on  a  set  of  abstract  features  computed  for  a  specific  task. 

For  example,  through  this  analysis  we  are  able  to  compare  the  DP’s  of  dif¬ 
ferent  spatial  segments/components  of  a  face.  We  can  apply  the  analysis  to 
segments  of  the  face  images  such  as  the  areas  around  the  eyes,  mouth,  hair, 
chin,  or  combinations  of  them.  Figure  6.1  shows  a  separation  analysis  for  hori¬ 
zontal  segments  of  the  face  images  in  the  database.  The  results  show  that  the 
DP’s  of  all  segments  are  comparable,  and  that  the  area  between  the  nose  and  the 
mouth  has  more  identification  information  than  other  parts.  Figure  6.2  shows 
that  the  DP  of  the  whole  image  is  significantly  larger  than  the  DP’s  of  its  parts. 
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Figure  6.1:  Variation  of  the  discriminatory  power  of  horizontal  seg¬ 
ments  of  the  face  defined  by  a  window  of  fixed  height  sliding  from 
top  to  bottom  of  the  image. 


Figure  6.2:  Variation  of  the  discriminatory  power  of  a  horizontal 
segment  of  the  face  that  grows  in  height  from  the  top  to  the  bottom 
of  the  image. 


Using  wavelet  transforms  [19,  56,  22]  as  multi-scale  orthogonal  representa¬ 
tions  of  face  images,  we  can  also  perform  a  comparative  analysis  of  the  DP’s 
of  subimages  in  the  wavelet  domain.  Different  components  of  a  wavelet  de¬ 
composition  capture  different  visual  aspects  of  a  gray  scale  image.  As  Figure 
6.3  shows,  at  each  level  of  decomposition  there  are  four  orthogonal  subimages 
corresponding  to 


•  LL:  The  smoothed,  lovz-frequency  variations. 

•  LH:  Sharp  changes  in  the  horizontal  direction,  i.e.  vertical  edges. 
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Figure  6.3:  Different  components  of  a  wavelet  transform,  capturing 
sharp  variations  of  the  image  intensity  in  different  directions,  have 
different  discriminatory  potentials.  The  numbers  represent  the  rela¬ 
tive  discriminatory  power. 

•  HL:  Sharp  changes  in  the  vertical  direction,  i.e.  horizontal  edges. 

•  HH:  Sharp  changes  in  non- horizontal/non- vertical  directions,  i.e.  other 
edges. 

We  applied  the  LDA  to  each  subimage  of  the  WT  of  the  face  and  estimated 
the  discriminatory  power  of  each  subband.  Figure  6.3  compares  the  separations 
obtained  using  each  of  the  subbands.  Despite  their  equal  sizes,  different  subim¬ 
ages  carry  different  amounts  of  information  for  classification;  the  low-resolution 
component  is  the  most  informative.  The  horizontal  edge  patterns  are  almost  as 
important  as  the  vertical  edge  patterns,  and  their  relative  importance  depends 
on  the  scale.  Finally,  the  least  important  component  in  terms  of  face  discrimi¬ 
nation  is  the  fourth  subband,  i.e.  the  slanted  edge  patterns.  These  results  are 
consistent  with  our  intuition  and  also  with  subjective  psychovisual  experiments. 

One  can  also  apply  this  idea  to  study  the  importance  of  facial  components 
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for  gender  or  race  classification  from  images. 


6.3  Discriminant  Eigenfeatures  for  Face  Recognition 

In  this  section  we  propose  a  new  algorithm  for  face  recognition  that  makes  use  of 
a  small,  yet  efficient,  set  of  discriminant  eigentemplates.  The  analysis  is  similar 
to  the  method  suggested  by  Pentland  et  al.[66,  80],  which  is  based  on  PCA  and 
KLT.  The  fundamental  difference  is  that  in  our  system  eigenvalue  analysis  is 
performed  on  the  separation  matrix  rather  than  the  covariance  matrix. 

Human  face  images  as  two-dimensional  patterns  have  a  lot  in  common  and 
are  spectrally  very  similar.  Therefore,  considering  the  face  image  as  a  whole,  one 
expects  to  see  important  discriminant  features  that  have  low  energies.  These 
low-energy  discriminant  features  may  not  be  captured  in  a  compression-based 
feature  extraction  scheme  like  PCA,  or  even  in  multi-layer  neural  networks, 
which  rely  on  minimization  of  average  Euclidean  error.  In  fact,  there  is  no 
guarantee  that  the  error  incurred  by  applying  the  compression  scheme,  despite 
its  low  energy,  does  not  carry  significant  discriminatory  information.  Also,  there 
is  no  reason  to  believe  that  for  a  given  compression-based  feature  space,  feature 
points  corresponding  to  different  poses  of  the  same  subject  will  be  closer  (in 
Euclidean  distance)  to  each  other  than  to  those  of  other  subjects.  In  fact  it  has 
been  argued  and  experimentally  shown  that  ignoring  the  first  few  eigenvectors, 
corresponding  to  the  top  principal  components,  can  lead  to  a  substantial  increase 
in  recognition  accuracy  [66,  63].  Therefore  the  secondary  selection  from  the  PCA 
vectors  is  based  on  their  discriminatory  power.  But  one  could  ask,  why  do  we 
not  start  with  a  criterion  based  on  discrimination  rather  than  representation 
from  the  beginning,  to  make  the  whole  process  more  consistent? 

The  KLT/PCA  approach  provides  us  with  features  that  capture  the  main 
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directions  along  which  face  images  differ  the  most,  but  it  does  not  attempt  to 
reduce  the  within-class  scatter  of  the  feature  points.  In  other  words,  since  no 
class  membership  information  is  utilized,  examples  of  the  same  class  or  different 
classes  are  treated  in  the  same  way.  LDA,  however,  uses  the  class  membership 
information  and  allows  us  to  find  eigenfeatures  and  therefore  representations  in 
which  the  variations  among  different  faces  are  emphasized,  while  the  variations 
of  the  same  face  due  to  illumination  conditions,  facial  expression,  orientation 
etc.  are  de-emphasized. 

According  to  this  observation,  and  based  on  the  results  that  follow,  we  believe 
that  for  classification  purposes,  LDA-based  feature  extraction  seems  to  be  an 
appropriate  and  logical  alternative  to  PCA,  KLT,  or  any  other  compression- 
based  system  which  tries  to  find  the  most  compact  representation  of  face  images. 
Concurrently,  but  independently  of  our  studies,  LDA  has  been  used  by  Swet  and 
Weng  [78,  79]  to  discriminate  human  faces  from  other  objects. 

In  order  to  capture  the  inherent  symmetry  of  basic  facial  features  and  the 
fact  that  a  face  can  be  identified  from  its  mirror  image,  we  can  use  the  mirror 
image  of  each  example  as  a  source  of  information  [48].  Also,  by  adding  noisy  but 
identifiable  versions  of  the  given  examples,  we  can  expand  our  training  data  and 
improve  the  robustness  of  the  feature  extraction  against  small  amount  of  noise 
in  the  input.  Therefore,  for  each  image  in  the  database  we  include  its  mirror  4 
of  its  noisy  versions,  as  shown  in  Figure  6.4.  We  thus  have 

$  =  :s  =  1,2,..., iVs}  (6.3.3) 

=  {<^|,^,(<^|-Hz/):i  =  l,2,...,IV£;,  i/  =  [iV(0,a2)r""}  (6.3.4) 

where  and  <i>\  v  are  mirror  images  and  noisy  versions  of  the  example  of 
subject  s  in  the  data  base  $,  respectively.  Also  Ns  is  the  number  of  subjects  and 
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Figure  6.4:  For  each  example  in  the  database  we  add  its  mirror  image 
and  some  noisy  versions. 

Ne  is  the  number  of  examples  per  subject  in  the  initial  database.  Following  our 
earlier  observations,  and  having  determined  the  separation  matrix,  we  perform 
eigenvalue  analysis  of  the  separation  matrix  on  the  augmented  database: 

eig{S^^^}  =  =  l,...,Ns  -  1,  A,  >  A,+i}  (6.3.5) 

Now  let  and  represent  the  set  of  m  largest  eigenvalues  of  and 
their  corresponding  eigenvectors.  As  discussed  in  Chapter  2,  minimizes 
the  drop  |Sep(X)  —  Sep(C/^A’)|  in  classification  information  incurred  by  the 
reduction  in  the  feature  space  dimensionality,  and  no  other  to  R“  linear 
mapping  can  provide  more  separation  than  does. 

Therefore,  the  optimal  linear  transformation  from  the  initial  representation 
space  in  R“  to  a  low-dimensional  feature  space  in  R“  based  on  our  selected 
separation  measure  results  from  projecting  the  input  vectors  <j)  onto  m  eigenvec¬ 
tors  corresponding  to  the  m  largest  eigenvalues  of  the  separation  matrix 
These  optimal  vectors/direction  can  be  obtained  from  a  sufficiently  rich  training 
set  and  can  be  updated  if  needed. 

The  columns  of  are  the  eigenvectors  corresponding  to  the  m  largest 


Figure  6.5:  Some  of  the  top  eigenpictures  ba.sed  on  PCA  (top)  and 
.  LDA(bottom). 

eigenvalues;  they  represent  the  directions  along  which  the  projections  of  the  face 
images  within  the  database  show  the  maximum  class  separation.  As  Figure  6.5 
shows,  unlike  the  KLT-based  eigenfaces  of  the  discriminant  eigenvectors,  these 
vectors  do  not  typically  have  face-like  patterns  and  are  not  directly  related  to 
our  intuitive  notions  of  isolated  features  of  human  faces  such  as  eyes,  hair,  chin, 
etc. 

Each  face  image  in  the  database  is  represented,  stored  and  tested  in  terms  of 
its  projections  onto  the  selected  set  of  discriminant  vectors,  i.e.  the  directions 
corresponding  to  the  largest  eigenvalues  of 

Vu  e  :  rM)  =<  u  >  (6.3.6) 

=  {$*(u)  :  Vu  e  1  =  1, ...,  Ns}  (6.3.7) 

Although  all  images  of  each  subject  are  considered  in  the  process  of  training, 
only  one  of  them  needs  to  be  saved,  as  a  template  for  testing.  If  a  view'-based 


approach  is  taken,  one  example  has  to  be  stored  for  each  distinct  view.  Since 
only  the  projection  coefficients  need  to  be  saved,  for  each  subject  we  retain  the 
example  that  is  closest  to  the  mean  of  the  corresponding  cluster  in  the  feature 
space.  Storing  the  projection  coefficients  instead  of  the  actual  images  is  highly 
desirable  when  large  databases  are  used.  Also,  applying  this  holistic  LDA  to 
multi-scale  representations  of  face  images,  one  can  obtain  multiscale  discrimi¬ 
nant  eigentemplates.  For  example  one  can  apply  LDA  to  each  component  of  the 
WT  of  the  face  images  and  select  the  most  discriminant  eigentemplates  obtained 
from  various  scales.  This  approach  is  more  complex  because  it  requires  the  WT 
computation  of  each  test  example,  but  in  some  applications  it  may  be  useful,  for 
example  when  the  DP  of  the  original  representation  is  not  captured  in  the  first 
few  eigenvectors,  or  when  the  condition  of  m  <  Aciasses  —  1  becomes  restrictive, 
e.g.  in  gender  classification. 

After  extracting  our  projection-based  discriminant  features,  we  apply  the 
multisource  decision  integration  scheme  of  Chapter  3.  In  the  process  of  decision 
integration  we  will  use  the  DP  of  each  decision  axis  resulting  from  a  projection 
as  a  measure  of  its  reliability.  Then,  for  each  presented  face,  we  apply  our 
simplified  distance  measure  of  equation  (3.3.30)  to  the  resulting  feature  vector 
to  obtain  a  sorted  list  of  the  top  candidates.  Figure  6.6  illustrates  distributions 
of  projection  coefficients  along  various  axes  for  a  four-class  case. 

6.4  Experiments  and  Results 

In  our  experiments,  in  order  to  satisfy  the  requirements  mentioned  above,  we 
used  a  mixture  of  two  databases.  We  started  with  the  database  provided  by 
Olivetti  Research  Ltd.  [75].  This  database  contains  10  different  images  of  each 
of  40  different  subjects.  All  the  images  were  taken  against  a  homogeneous 
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Figure  6.6:  The  distribution  of  projection  coefficients  along  three 
discriminant  vectors  with  different  levels  of  discriminatory  power  for 
several  poses  from  four  different  subjects. 


background  and  some  were  taken  at  different  times.  The  database  includes 
frontal  views  of  upright  faces  with  slight  changes  in  illumination,  facial  expres¬ 
sion  (open/closed  eyes,  smiling/ non-smiling),  facial  details  (glasses /no-glasses), 
and  some  side  movements.  Originally  we  chose  this  database  because  it  contains 
many  instances  of  frontal  views  for  each  subject.  Then,  to  increase  the  size  of  the 
database,  we  added  some  hand- segmented  face  images  from  the  Ferret  database 
[31].  We  also  included  mirror-image  and  noisy  versions  of  each  face  example  in 
order  to  expand  the  data  set  and  improve  the  robustness  of  recognition  perfor¬ 
mance  to  image  distortions.  The  total  numbers  of  images  used  in  training  and 
testing  were  about  1500  and  500  respectively.  Each  face  was  represented  by  a 
50  X  60  pixel  8-bit  gray-level  image,  which  for  our  experiments  was  reduced  to 


96 


Threshold 


The  Discriminant  Eigentemplate 


Figure  6.7:  Distribution  of  feature  points  for  male  and  female  exam¬ 
ples  in  the  database. 

25  X  30.  The  database  was  divided  into  two  disjoint  training  and  test  sets.  Using 
this  composite  database  we  performed  several  tests  on  gender  classification  and 
face  recognition. 

The  first  test  was  on  gender  classification  using  a  subset  of  the  database 
containing  multiple  frontal  views  of  20  males  and  20  females  of  different  races. 
The  LDA  was  applied  to  the  data  and  the  most  discriminant  template  was 
extracted.  Figure  6.7  shows  this  eigentemplate  and  the  distribution  of  projection 
coefficients  for  all  images  in  the  set.  As  Figure  6.7  shows,  with  only  one  feature 
very  good  separation  can  be  achieved.  Classification  tests  on  a  disjoint  test 
set  also  gave  95%  accuracy.  As  mentioned  above,  one  can  also  apply  LDA  to 
wavelet  transforms  of  face  images  and  extract  the  most  discriminant  vectors  of 
each  transform  component  and  combine  multiscale  classification  results  using  4 
the  proposed  method  of  soft  decision  integration. 

We  then  applied  LDA  to  a  database  of  1500  faces,  with  60  classes  corre- 
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Figure  6.8:  A  comparison  of  DP’s  of  the  top  40  selected  eigenvectors 
based  on  PCA  and  LDA. 

spending  to  60  individuals.  Figure  6.8  shows  the  discriminatory  power  of  the 
top  40  eigenvectors  chosen  according  to  PCA  and  LDA.  As  Figure  6.8  shows, 
the  classification  information  of  the  principal  components  does  not  decrease 
monotonically  with  their  energy;  in  other  words,  there  are  many  cases  where 
a  low-energy  component  has  a  higher  discriminatory  power  than  a  high-energy 
component.  The  figure  also  shows  that  the  top  few  discriminant  vectors  from 
LDA  contain  almost  all  the  classification  information  embedded  in  the  original 
image  space. 

Figure  6.9  shows  the  separation  of  clusters  for  ten  poses  of  four  different 
individuals  using  the  two  most  discriminatory  eigenvectors  or  eigenpictures.  As 
Figure  6.9  indicates,  the  differences  between  classes  (individuals)  are  emphasized 
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Figure  6.9;  Separation  of  clusters  in  the  selected  2-D  feature  space. 
Four  clusters  correspond  to  variations  of  the  faces  of  four  different 
subjects  in  the  database. 
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Figure  6.10:  Cluster  separation  in  the  best  2-D  feature  space,  based 
on  LDA  (top)  and  based  on  PCA  (bottom). 


while  the  variations  of  the  same  face  in  different  poses  are  de-emphasized.  The 
separation  is  achieved  despite  all  the  image  variations  resulting  from  the  various 
poses  of  each  subject.  Figure  6.10  shows  the  distribution  of  clusters,  for  200 
images  of  10  subjects,  in  the  best  two-dimensional  discriminant  feature  space  and 
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Figure  6.11;  Cluster  separation  in  the  best  2-D  discriminant  feature 
space  for  20  different  subjects. 

in  the  best  two-dimensional  PCA-based  space.  Figure  6.11  shows  the  clusters 
for  20  different  subjects  using  the  top  two  discriminant  eigenvectors. 

For  each  test  face  example,  we  first  projected  it  onto  the  selected  eigenvectors 
and  found  the  distance  from  the  corresponding  point  in  the  4-D  feature  space 
to  all  of  the  previously  saved  instances.  All  distances  were  measured  according 
to  equation  (3.3.30)  and  the  best  match  was  selected.  For  the  given  database 
excellent  (99.2%)  accuracy  was  achieved;  see  Table  6.1. 

The  simplicity  of  our  system,  the  size  of  the  database,  and  the  robustness 
of  the  results  to  small  variations  of  the  pose  or  noise  show  that  our  suggested 
scheme  is  a  good  alternative  approach  to  face  recognition.  It  provides  highly 
competitive  results  at  much  lower  complexity  using  low- dimensional  feature 
sizes. 


100 


No.  of 

No.  of 

Recogn.  Rate 

Recogn.  Rate 

Task 

Examples 

Features 

(Training  Set) 

(Test  Set) 

Face 

2000 

4 

100% 

99.2% 

Recognition 

Gender 

400 

1 

100% 

95% 

Classification 

Table  6.1:  Summary  of  recognition  rates. 


6.5  Conclusions 

The  application  of  LDA  to  study  the  discriminatory  power  of  various  facial 
features  in  the  spatial  and  wavelet  domains  is  presented.  Also,  an  LDA-based 
feature  extraction  scheme  for  face  recognition  is  proposed  and  tested. 

A  holistic  projection-based  approach  to  facial  feature  extraction  is  taken 
where  eigentemplates  are  the  most  discriminant  vectors  derived  from  the  LDA 
of  face  images  in  a  rich  enough  database.  The  effectiveness  of  the  proposed 
LDA-based  features  is  compared  with  that  of  PCA-based  eigenfaces.  For  clas¬ 
sification  a  variation  of  evidential  reasoning  is  used,  in  which  each  projection 
becomes  a  source  of  discriminating  information  with  reliability  proportional  to 
its  discrimination  power.  The  weighted  combination  of  similarity  or  dissimilar¬ 
ity  scores  suggested  by  all  projection  coefficients  is  the  basis  for  the  membership 
values. 

Several  results  on  face  recognition  and  gender  classification  are  presented, 
in  which  highly  competitive  recognition  accuracies  are  achieved  with  very  small 
numbers  of  features.  The  feature  extraction  can  be  applied  to  the  WP  represen¬ 
tations  of  the  images  to  provide  a  multiscale  discriminant  framework.  In  such 
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Chapter  7 
Conclusions 

The  combination  of  the  theories  of  wavelet-based  multiresolution  analysis  and 
discriminant  analysis  in  statistical  pattern  recognition  provides  us  with  pow¬ 
erful  and  flexible  frameworks  for  extracting  discriminant  features  for  pattern 
recognition/verification  and  segmentation  systems. 

Multi-scale  features  can  be  built  in  the  process  of  selecting  or  linearly  com¬ 
bining  waveforms  from  a  library  of  local  basis  functions  with  the  objective  of 
obtaining  largest  class  separability  in  the  feature  space.  The  original  library  or 
dictionary  of  waveforms  may  be  a  combination  of  different  classes  of  orthogonal 
wavelets,  Gabor  functions  and  local  trigonometric  functions.  For  the  case  of 
tree-structured  orthogonal  local  bases  such  as  wavelet  packets  and  local  trigono¬ 
metric  functions  there  are  fast  search  algorithms  to  find  the  most  discriminatory 
basis,  whereas  for  redundant  dictionaries  only  suboptimal  greedy  search  algo¬ 
rithms  are  available.  The  resulting  selection  of  local  waveforms  may  not  be  a 
complete  basis  for  the  signal  space,  as  is  required  in  function  approximation 
problems. 

In  many  classification/recognition  based  systems  decisions  are  based  on  mul¬ 
tiple  features  or  sources  of  information,  where  different  features  or  sources  have 
different  levels  of  reliability  or  impreciseness.  The  results  of  classification  based 
on  incomplete,  noisy  or  mixed  data  or  sources  of  different  reliabilities  can  be  best 
represented  using  soft  decision  vectors.  Following  the  principle  of  least  commit¬ 
ment  one  needs  to  keep  all  intermediate  results  as  soft  decisions  up  to  the  last 
step  when  a  crisp  decision  may  be  needed.  Soft  decision  boundaries  and  thereby 
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fuzzy  partitions  of  the  feature  space  are  obtained  using  the  nonlinear  mapping 
of  multilayer  neural  networks  or  distance-based  similarity /dissimilarity  scores. 

In  some  applications  it  is  possible  to  take  advantage  of  multiple  observations 
in  a  spatial/temporal  neighborhood  or  context  area  in  the  final  decision  making. 
We  investigated  soft  decision  integration  using  a  consensus  rule  which  is  a  varia¬ 
tion  of  a  linear  or  logarithmic  opinion  pool  with  discrimination-based  weighting 
factors.  Also,  we  enhanced  the  result  by  combining  decisions  in  a  context  area 
based  on  a  relevance  pattern.  Decision  integration  can  be  implemented  in  a 
probabilistic  or  evidential  frame  of  reasoning. 

We  explored  these  ideas  by  testing  them  on  a  variety  of  applications,  includ¬ 
ing 


•  Recognition  of  Real  Aperture  Radar  returns 

•  Classification  and  segmentation  of  texture  images 

•  Layout-independent  segmentation  of  complex  document  pages 

•  Automatic  face  recognition 

Despite  the  many  differences  among  these  applications,  we  consistently  obtained 
promising  results  using  fundamentally  similar  ideas.  In  some  applications,  such 
as  face  recognition  and  document  page  segmentation,  we  presented  a  completely 
new  approach  to  the  problem,  while  in  other  cases  (e.g.  texture  and  radar 
classification)  we  obtained  competitive  results  using  systems  of  lower  complex¬ 
ity  based  on  our  alternative  discriminant  feature  extraction  and  classification 
methodology. 

Future  Work:  In  this  dissertation  we  have  explored  some  aspects  of  context- 
dependent  pattern  recognition  systems  using  a  toolbox  containing  multi-scale 


104 


signal  processing  algorithms  and  multi-source  decision  integration  methodolo¬ 
gies.  We  hope  that  the  results  described  in  this  dissertation  will  serve  as  a 
basis  for  further  investigations.  This  work  seems  to  be  extendible  in  different 
directions  from  analytical  and  implementation  points  of  view. 

In  the  context  of  multi-scale  discriminant  basis  selection  further  studies  can 
be  done  on  an  appropriate  choice  of  a  composite  and  redundant  dictionary  and 
on  developing/testing  various  greedy-type  separation  pursuit  algorithms.  Also, 
separation-based  local  basis  selection  may  have  applications  in  noise  suppression 
and  signal  enhancement  systems,  where  noise  is  defined  as  an  additional  class 
and  through  projection-based  methods,  the  system  tries  to  find  a  multi-scale 
representation  in  which  there  is  maximum  separation/difference  between  the 
signal  components  and  the  noise. 

Also,  more  extensive  research  on  discrimination  and  relevance-based  decision 
integration  using  more  sophisticated  and  efficient  consensus  rules  is  needed.  This 
may  involve  various  combinations  of  probabilistic,  evidential,  fuzzy  and  neural- 
based  approaches  in  decision  making. 

Future  work  can  also  address  new  applications.  One  can  test  our  proposed 
scheme  on  many  other  signal  and  image  processing  tasks,  such  as  recognition 
of  speech  and  acoustic  signals  or  analysis  of  aerial  or  medical  images.  Also,  in 
some  of  the  applications  that  we  have  explored  there  is  room  for  extensions  and 
enhancements.  For  example,  our  results  on  page  segmentation  can  be  linked  to 
higher  levels  of  knowledge-based  post-processing  for  document  understanding 
and/or  compression.  In  face  recognition,  our  work  can  be  extended  to  larger 
and  richer  databases,  covering  wider  variations  of  race,  age,  gender,  etc.  Such  a 
complete  database  can  also  be  used  for  more  detailed  analysis  of  facial  features 
that  can  be  compared  and  linked  with  psychophysical  findings. 
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